kohya-ss / sd-scripts

Apache License 2.0
4.92k stars 824 forks source link

Learn TI Embedding and LoRA both at the same time #635

Open aleksusklim opened 1 year ago

aleksusklim commented 1 year ago

Is it possible to train a LoRA together with an Embedding? Here are some thoughts that came to this, when training a LoRA for an object:

  1. Training the entire CLIP is wrong. It is best left frozen.
  2. Without learnable CLIP, we cannot change the meaning of words.
  3. With or without learned CLIP, given a prompt "a photo of sks in the forest" – why would LoRA learn sks but not learn photo and forest along?
  4. Generally, I do not want to learn anything except my token.
  5. You could say "just use TI then!", but Embeddings are weak at learning complex concepts.
  6. You could say "use regularization then!", but in this case there is no "class word" (and I don't want to introduce it); making regularization against "forest" and anything I might have in descriptions – feels wrong.
  7. If it would be possible to use a learnable embedding in place of chosen token ("sks", possibly initialized with class word), then it would be more correct, because the object would be clearly stored inside this embedding and not in any other word.
  8. LoRA general training should help the embedding to reach its target quicker. It's a compromise between training the entire CLIP or not at all.
  9. Learning rate for the embedding should be set differently than learning rate for U-Net (or for CLIP if needed), because the best speed is yet to be discovered.

What do you think? Otherwise, I'm not quite sure how to train LoRA on something that is not a character nor a style. For example, to train a LoRA for "scar" concept: what descriptions should we choose? Should we say "sks over eye, 1boy, …"? If so, isn't it more logical to say directly "scar over eye, 1boy, …"? But if so, how can we be sure that only the concept of "scar" would be changed, and not the concept of "1boy"?

aleksusklim commented 1 year ago

Related: https://huggingface.co/blog/dreambooth#epilogue-textual-inversion--dreambooth (last chapter)

AI-Casanova commented 1 year ago

At very least it would be nice to add TI loading in train_network.py such that a TI could be trained first and then a UNet LoRA trained afterwards.

spillerrec commented 10 months ago

HCP Diffusion supports this, but I have not yet been able to actually get it to work. I have seen other using it however.

I have been thinking about this approach a lot as well, because I don't think the current method is that good. If you just train the text encoder, you can get decent results. If you train both the text encoder and unet, the results are better, but if you try to disable the unet part of it, the results are really poor. This indicates that the text encoder is not fully taken advantage of.

I have two big motivations for looking for a better approach. First of all I think better exploiting existing capabilities of the base model will lead to better flexibility of the resulting Lora (you can end up with certain prompts, like a specific pose, that works fine without the Lora become unreliable or completely break with the Lore). However what I really would like to see is better composability with other Lora's and base models. With normal Lora training, the entire text encoder is affected instead of just the trigger tag we are trying to add.

When I tried to test how much other tags in the text encoder were affected, I saw numbers around 20-40% compared to the main trigger tag. I haven't messed with drop-out or anything like that, but for completely unrelated tokens to be so affected was quite surprising to me. I also question the usability of actually trying to train the text encoder. Does it actually learn something about the interaction between the tokens? I didn't really see any indication of that in my testing. For some things like poses, the context could affect the learned tag, for example from_side, pov, from_above will affect the pose. But for many things I think a static TI is probably a good fit. I also question how well the text encoder works in general. I mostly use anime models which might be worse with tag relations (?), but I have run into several examples of tags which interfere with each other and does not work as intended. For example hug from behind ends up being from behind + hug and shows the characters back. It fails to understand that this combination is a specific concept that is not just an addition of the underlying words.

In order to actually have "trigger words", I do think training both the TI and UNet together will be necessary, in order to create a link between the tag and the UNet doing something different. But pretraining the TI could potentially be useful. But it would be a nice first step. I train using anime screenshots as a base, and I wonder if you could potentially train a base style TI to reduce the influence of the common style of the training images.

aleksusklim commented 9 months ago

This paper – https://omriavrahami.com/the-chosen-one/ – features training two text inversion embeddings for SDXL along with LoRA simulateniously:

We base our solution on a pre-trained Stable Diffusion XL (SDXL) [57] model, which utilizes two text encoders: CLIP [61] and OpenCLIP [34]. We perform textual inversion [20] to add a new pair of textual tokens τ , one for each of the two text encoders. However, we found that this parameter space is not expressive enough, as demonstrated in Section 4.3, hence we also update the model weights θ via a low-rank adaptation (LoRA) [33, 71] of the self- and crossattention layers of the model.

spillerrec commented 9 months ago

I'm reading up on how these models work and I still only have a very superficial understanding, but I noticed this section in the original Lora paper:

E COMBINING LORA WITH PREFIX TUNING LoRA can be naturally combined with existing prefix-based approaches. In this section, we evaluate two combinations of LoRA and variants of prefix-tuning on WikiSQL and MNLI. LoRA+PrefixEmbed (LoRA+PE) combines LoRA with prefix-embedding tuning, where we insert lp + li special tokens whose embeddings are treated as trainable parameters. For more on prefixembedding tuning, see Section 5.1. LoRA+PrefixLayer (LoRA+PL) combines LoRA with prefix-layer tuning. ... In Table 15, we show the evaluation results of LoRA+PE and LoRA+PL on WikiSQL and MultiNLI. First of all, LoRA+PE significantly outperforms both LoRA and prefix-embedding tuning on WikiSQL, which indicates that LoRA is somewhat orthogonal to prefix-embedding tuning. On MultiNLI, the combination of LoRA+PE doesn’t perform better than LoRA, possibly because LoRA on its own already achieves performance comparable to the human baseline. ...

https://arxiv.org/abs/2106.09685

Isn't this "prefix-embedding tuning" the same as textual inversion?

aleksusklim commented 9 months ago

Sigh: https://civitai.com/articles/2494/making-better-loras-with-pivotal-tuning https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/13568 https://github.com/IrisRainbowNeko/HCP-Diffusion/blob/main/doc/en/user_guides/train.md#prompt-template-usage-with-text_transforms

AI-Casanova commented 9 months ago

I'll clean up my code and PR it.

Doesn't train both at once, but loads TI into the LoRA trainer and works quite well.

feffy380 commented 7 months ago

I've been messing with Poiuytrezay1's PR and my experience is the TI overfits on style quite quickly, so you probably want to train them separately anyway. The quality difference between PTI and LoRA alone wasn't worth switching for, but the TI behaves as a trigger word without the need for dreambooth-style regularization images. I'm sure you'll get bleed if you train the unet long enough, but that takes longer than most single concept loras are trained for.

aleksusklim commented 7 months ago

I have an idea that I didn't had time to try.

  1. Overtrain the TI embedding
  2. Use my EmbeddingMerge to normalize it (divide on its norm or even slightly more)
  3. Use it as a trigger word with LoRA training (assuming a pipeline that supports using embeddings during training)
  4. Set network_train_unet_only to freeze CLIP.

Learning rate at 1. should be high. We don't care if the embedding breaks as-is. After 2. the embedding will "stop working" but still would "mean something" Step 3. can be hacked by dumping text latents to disk and patch them manually, adding embedding vectors. Everything else in 4. as normally. The LoRA will work only with the embedding, obviously.

feffy380 commented 7 months ago

I just used their other PR which ports cloneofsimo's code to normalize during training https://github.com/kohya-ss/sd-scripts/pull/993

A norm of 1 is probably already too high. IIRC the PTI authors found the embedding works best if it's at least somewhat close to other real embeddings. In this case that means initializing with an existing token (init_word) and keeping the norm close to 0.4

aleksusklim commented 7 months ago

I thought the normalization during training will compromise its speed. My idea is to overtrain TI quickly and start LoRA!

feffy380 commented 7 months ago

Normalizing after training is not going to suddenly un-overfit it

aleksusklim commented 7 months ago

It will "disable" the embedding, as if it wasn't trained at all. I played with normalization of my trained TI with my EM, and the result looked like it wasn't trained at all.

Which is what I want to try for LoRA instead of training CLIP or using a trigger word.