rinongal / textual_inversion

MIT License
2.9k stars 279 forks source link

Finetuning for multiple classes #114

Closed NamburiSrinath closed 1 year ago

NamburiSrinath commented 1 year ago

Hi,

I tried playing with Stable diffusion (https://github.com/huggingface/diffusers) and tried to generate images, but didn't achieve good quality ones.

Example prompt: "Aeroplane" Generated image plane

Note: Some generated images are of nice quality (eg: plane2 )

I came across your repo and found that I can use "Textual Inversion" and fine tune to get the concept transferred. But I have multiple classes (ship, aeroplane etc;) and would like to know how can I fine tune on multiple classes.

In short, while inferencing, you mentioned that we need to prompt it as "A photo of *", but I would like to do "A photo of an aeroplane", "A photo of a ship" etc; after fine tuning the model. (I checked this and am curious to know if this repo can work for my case - https://github.com/rinongal/textual_inversion/issues/8)

Thanks in advance Srinath

rinongal commented 1 year ago

Hi,

If you're using this implementation, you can just individually train one model for each concept and then use the merge embedding script to combine them into a single model. There's instructions for that in the readme.

You can also changr your placeholder token from * to 'ship' or 'plane'. Look at either the config file, or main.py's run arguments for how to do this.

NamburiSrinath commented 1 year ago

Thanks for your response @rinongal. I tried to invert for 2 classes "airplane" and "truck"

Inversion command (for airplane):

python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml 
               -t
               --actual_resume models/ldm/text2img-large/model.ckpt 
              -- placeholder_string "airplane" 
               -n airplane_run_1 
               --gpus 0,
               --data_root train_data/airplane 
               --init_word "airplane" 

And the inference command is:

python scripts/txt2img.py --ddim_eta 0.0 
                          --n_samples 8 
                          --n_iter 2 
                          --scale 10.0 
                          --ddim_steps 50 
                          --embedding_path /hdd2/srinath/textual_inversion/logs/airplane2022-10-25T23-13- 
                            42_truck_run_1/checkpoints/embeddings_gs-6099.pt 
                          --ckpt_path models/ldm/text2img-large/model.ckpt 
                          --prompt "a photo of airplane"

The images in the train_data/airplane are: 1 2 3 4 5

And the images generated in outputs/samples are: a-photo-of-*

which I believe is a fair generation of airplane capturing the style/features present in the above 5 images.

But I've few questions and need suggestions from your end

  1. I am having some difficulty in understanding the checkpoints structure. PFA the checkpoints for airplane Screen Shot 2022-10-26 at 11 31 20 AM
  1. How can I generalize this behaviour? i.e Suppose I want different colors of airplanes, how can I do that. By prompt engineering ("a photo of airplane in blue") and/or having different styles related images present in train_data/airplane (i.e make sure you have a blue airplane in train data)

  2. I also tried the merge script that you suggested (the command I ran is)

    python merge_embeddings.py \
    --manager_ckpts /hdd2/srinath/textual_inversion/logs/airplane2022-10-25T22-56-37_airplane_run_1/checkpoints/embeddings_gs-6099.pt \
    /hdd2/srinath/textual_inversion/logs/truck2022-10-25T23-13-42_truck_run_1/checkpoints/embeddings_gs-6099.pt \
    --output_path airplane_truck.pt

    and it did generate the airplane_truck.pt file. Now my question is,

  3. Can I safely assume that airplane_truck.pt is better than individual pt embeddings in generating the images of airplane and truck when we pass a prompt?

Thanks a lot for your time :) Srinath

rinongal commented 1 year ago
  1. The number in the _gs-xxxx.pt prefix is the step number (the number of training iterations). This is not the number of epochs, but you can estimate the number of epochs by dividing this number by your number of training images.

You can find an explanation of the logged images here: https://github.com/rinongal/textual_inversion/issues/19 and here: https://github.com/rinongal/textual_inversion/issues/34

  1. Generalizing: If you are using LDM, you can just change the text you use for generation, so: "a photo of airplane in blue" should be fine. If you see that it fails to change, try using an earlier training checkpoint since you may be overfit (typically ~5000 steps should be good enough). You do not need to have blue airplanes in your data.

If you are using Stable Diffusion, their text encoder is significantly weaker and more prone to overfitting, which may make some modifications harder. In that case, you may have to use more complex prompts or some prompt-weighting methods (for which I'd recommend the AUTOMATIC1111 WebUI

3(+4). The merged embedding file is just putting both of your new words into one single file. It won't create better trucks than just the truck file, and it won't create better airplanes than just the airplane file. It just stores them in a way that lets you access both at the same time.

NamburiSrinath commented 1 year ago

Thank you so much. Have additional questions which are not in this thread, so closing this thread :)

yuxu915 commented 1 year ago

@rinongal @NamburiSrinath Hi, I have a relate question: I use Stable Diffusion and I find it's very hard to modification. For example, below are images generated by using prompts "a photo of " and "a photo of in river". The embedding is at step 499 and should not be overfitting.

image image
yuxu915 commented 1 year ago

@NamburiSrinath hi, I'd like to ask the merge of plane and truck, the output images are mixture of airplanes and trucks, or a separate airplane and truck? Thank you.