Train Textual Inversion in alternative models of Stable Diffusion

fdagostino commented 1 year ago

Hi @rinongal, how are you? I'm trying to train a person face with Textual Inversion on two models, the standard 1.5 base model and Deliberate V2 (DreamBooth fine tuned model based on 1.5, very photorealistic).

When doing the train on 1.5 model, it converges successfully, but when doing the train on Deliberate it doesn't converges despite testing all kind of config (different LR, etc.), including going 10k steps (the only setting that I've kept fixed is using two vectors to represent the token). Someone told me that It could be related to EMA Weights on the model, but it doesn't make so much sense to me because when we are training we are moving the vectors around trying to find a position that represent the face and I don't see how the EMA Weights could be related to not finding the vector position.

Do you have any idea/insight/intuition on why I can find a vector that represent the face in the base model but can't find it in the other being that the last one was trained based on the other and is quite photorealistic?

I want to deepen this, any direction on how to debug is very appreciated!

Thanks, Fran

rinongal commented 1 year ago

Hi @fdagostino ,

First of all, a possible workaround may be to train the face in the V1.5 model, and then initialize your DeliberateV2 training using this learned face embedding. The embeddings tend to transfer reasonably well between fine-tuned models, so it might serve as a good initialization. I haven't actively tried doing this, but these sorts of tricks typically work with GANs.

On the broader question - I'm actually not sure why this would happen. I wouldn't expect EMA weights to have a large impact on embedding tuning. Two possible things that do come to mind: 1) If the weights are saved / loaded with different precision than the baseline model, this may have an impact. 2) If Deliberate V2 is DreamBooth trained, does it come with its own keyword? This might conflict with the inversion process (e.g. the DB keyword might take attention away from the new TI word). Do your training prompts include this keyword? Have you tried removing / adding it to the prompts?

fdagostino commented 1 year ago

Thanks @rinongal! Will try to initialize with the learned embedding. I don't really know how the model is trained, only that is a fine-tuned version of 1.5 model.

RahulSajnani commented 11 months ago

How do I find characters that do not have multiple embeddings for open clip (SDv2)? I have changed the code to work with my need, but no matter which placeholder character I try, it is always in the embedding. Can you help me @rinongal ?

rinongal / textual_inversion

Train Textual Inversion in alternative models of Stable Diffusion #153