rinongal / textual_inversion

MIT License
2.9k stars 279 forks source link

How to use two new tokens at the same time? #109

Closed g-jing closed 1 year ago

g-jing commented 1 year ago

Currently, the training script contain --init_word that allow us to train one word. During the generation process, we will use --prompt to use the new token.

python scripts/txt2img.py --ddim_eta 0.0 
                          --n_samples 8 
                          --n_iter 2 
                          --scale 10.0 
                          --ddim_steps 50 
                          --embedding_path /path/to/logs/trained_model/checkpoints/embeddings_gs-5049.pt 
                          --ckpt_path /path/to/pretrained/model.ckpt 
                          --prompt "a photo of *"

How to modify this script to use two new tokens at the same time?

rinongal commented 1 year ago

If you're talking about using two tokens at the same time for inference, you'll need to merge their embedding files first. Have a look at the "Merging Checkpoints" part of the readme.

If both of your concepts are using the default "" placeholder, then the script will prompt you to choose another placeholder for one of them (e.g. you can use "@"). At that point you can just use txt2img.py as normal but run the prompt as: "a photo of in the style of @" or things of that sort.

g-jing commented 1 year ago

For the embedding, sometimes you used SD (stable diffusion), sometimes you use LDM (latent diffusion model). I thought they are about the same paper https://arxiv.org/abs/2112.10752. Could you clarify this? Just wanna make sure I understand the issue.

rinongal commented 1 year ago

Stable Diffusion and LDM are not the same model. They do not use the same text encoder, so the embeddings you learn with each of them are not transferable and you can't merge them into a single model (they don't even have the same dimension).

g-jing commented 1 year ago

Thanks for your clarification. Do you know where we could find a detailed difference between stable diffusion and latent diffusion model? They are from the same paper and I did not find other description about them.

rinongal commented 1 year ago

The main difference between them is that Stable Diffusion uses a frozen CLIP text encoder, while the text-conditioned version of LDM used a BERT encoder which was trained along with the generator.

Other than that, there is a difference in the training set and length of training, the image resolution, and some minor parameter differences in the architectures. They are from 'the same paper' in the sense that they use the same underlying idea of training a diffusion model to predict latent codes in the encoder/decoder's space, rather than a full resolution image.

g-jing commented 1 year ago

Thanks so much for this detailed clarification! Appreciated!