Closed NamburiSrinath closed 1 year ago
Hi,
If you're using this implementation, you can just individually train one model for each concept and then use the merge embedding script to combine them into a single model. There's instructions for that in the readme.
You can also changr your placeholder token from * to 'ship' or 'plane'. Look at either the config file, or main.py's run arguments for how to do this.
Thanks for your response @rinongal. I tried to invert for 2 classes "airplane" and "truck"
Inversion command (for airplane):
python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml
-t
--actual_resume models/ldm/text2img-large/model.ckpt
-- placeholder_string "airplane"
-n airplane_run_1
--gpus 0,
--data_root train_data/airplane
--init_word "airplane"
And the inference command is:
python scripts/txt2img.py --ddim_eta 0.0
--n_samples 8
--n_iter 2
--scale 10.0
--ddim_steps 50
--embedding_path /hdd2/srinath/textual_inversion/logs/airplane2022-10-25T23-13-
42_truck_run_1/checkpoints/embeddings_gs-6099.pt
--ckpt_path models/ldm/text2img-large/model.ckpt
--prompt "a photo of airplane"
The images in the train_data/airplane
are:
And the images generated in outputs/samples
are:
which I believe is a fair generation of airplane capturing the style/features present in the above 5 images.
But I've few questions and need suggestions from your end
How can I generalize this behaviour? i.e Suppose I want different colors of airplanes, how can I do that. By prompt engineering ("a photo of airplane in blue") and/or having different styles related images present in train_data/airplane
(i.e make sure you have a blue airplane in train data)
I also tried the merge script that you suggested (the command I ran is)
python merge_embeddings.py \
--manager_ckpts /hdd2/srinath/textual_inversion/logs/airplane2022-10-25T22-56-37_airplane_run_1/checkpoints/embeddings_gs-6099.pt \
/hdd2/srinath/textual_inversion/logs/truck2022-10-25T23-13-42_truck_run_1/checkpoints/embeddings_gs-6099.pt \
--output_path airplane_truck.pt
and it did generate the airplane_truck.pt file. Now my question is,
Can I safely assume that airplane_truck.pt is better than individual pt embeddings in generating the images of airplane and truck when we pass a prompt?
Thanks a lot for your time :) Srinath
You can find an explanation of the logged images here: https://github.com/rinongal/textual_inversion/issues/19 and here: https://github.com/rinongal/textual_inversion/issues/34
If you are using Stable Diffusion, their text encoder is significantly weaker and more prone to overfitting, which may make some modifications harder. In that case, you may have to use more complex prompts or some prompt-weighting methods (for which I'd recommend the AUTOMATIC1111 WebUI
3(+4). The merged embedding file is just putting both of your new words into one single file. It won't create better trucks than just the truck file, and it won't create better airplanes than just the airplane file. It just stores them in a way that lets you access both at the same time.
Thank you so much. Have additional questions which are not in this thread, so closing this thread :)
@rinongal @NamburiSrinath Hi, I have a relate question: I use Stable Diffusion and I find it's very hard to modification. For example, below are images generated by using prompts "a photo of " and "a photo of in river". The embedding is at step 499 and should not be overfitting.
@NamburiSrinath hi, I'd like to ask the merge of plane and truck, the output images are mixture of airplanes and trucks, or a separate airplane and truck? Thank you.
Hi,
I tried playing with Stable diffusion (https://github.com/huggingface/diffusers) and tried to generate images, but didn't achieve good quality ones.
Example prompt: "Aeroplane" Generated image
Note: Some generated images are of nice quality (eg: )
I came across your repo and found that I can use "Textual Inversion" and fine tune to get the concept transferred. But I have multiple classes (ship, aeroplane etc;) and would like to know how can I fine tune on multiple classes.
In short, while inferencing, you mentioned that we need to prompt it as "A photo of *", but I would like to do "A photo of an aeroplane", "A photo of a ship" etc; after fine tuning the model. (I checked this and am curious to know if this repo can work for my case - https://github.com/rinongal/textual_inversion/issues/8)
Thanks in advance Srinath