Open aleksusklim opened 1 year ago
Related: https://huggingface.co/blog/dreambooth#epilogue-textual-inversion--dreambooth (last chapter)
At very least it would be nice to add TI loading in train_network.py such that a TI could be trained first and then a UNet LoRA trained afterwards.
HCP Diffusion supports this, but I have not yet been able to actually get it to work. I have seen other using it however.
I have been thinking about this approach a lot as well, because I don't think the current method is that good. If you just train the text encoder, you can get decent results. If you train both the text encoder and unet, the results are better, but if you try to disable the unet part of it, the results are really poor. This indicates that the text encoder is not fully taken advantage of.
I have two big motivations for looking for a better approach. First of all I think better exploiting existing capabilities of the base model will lead to better flexibility of the resulting Lora (you can end up with certain prompts, like a specific pose, that works fine without the Lora become unreliable or completely break with the Lore). However what I really would like to see is better composability with other Lora's and base models. With normal Lora training, the entire text encoder is affected instead of just the trigger tag we are trying to add.
When I tried to test how much other tags in the text encoder were affected, I saw numbers around 20-40% compared to the main trigger tag. I haven't messed with drop-out or anything like that, but for completely unrelated tokens to be so affected was quite surprising to me.
I also question the usability of actually trying to train the text encoder. Does it actually learn something about the interaction between the tokens? I didn't really see any indication of that in my testing. For some things like poses, the context could affect the learned tag, for example from_side
, pov
, from_above
will affect the pose. But for many things I think a static TI is probably a good fit.
I also question how well the text encoder works in general. I mostly use anime models which might be worse with tag relations (?), but I have run into several examples of tags which interfere with each other and does not work as intended. For example hug from behind
ends up being from behind
+ hug
and shows the characters back. It fails to understand that this combination is a specific concept that is not just an addition of the underlying words.
In order to actually have "trigger words", I do think training both the TI and UNet together will be necessary, in order to create a link between the tag and the UNet doing something different. But pretraining the TI could potentially be useful. But it would be a nice first step. I train using anime screenshots as a base, and I wonder if you could potentially train a base style TI to reduce the influence of the common style of the training images.
This paper – https://omriavrahami.com/the-chosen-one/ – features training two text inversion embeddings for SDXL along with LoRA simulateniously:
We base our solution on a pre-trained Stable Diffusion XL (SDXL) [57] model, which utilizes two text encoders: CLIP [61] and OpenCLIP [34]. We perform textual inversion [20] to add a new pair of textual tokens τ , one for each of the two text encoders. However, we found that this parameter space is not expressive enough, as demonstrated in Section 4.3, hence we also update the model weights θ via a low-rank adaptation (LoRA) [33, 71] of the self- and crossattention layers of the model.
I'm reading up on how these models work and I still only have a very superficial understanding, but I noticed this section in the original Lora paper:
E COMBINING LORA WITH PREFIX TUNING LoRA can be naturally combined with existing prefix-based approaches. In this section, we evaluate two combinations of LoRA and variants of prefix-tuning on WikiSQL and MNLI. LoRA+PrefixEmbed (LoRA+PE) combines LoRA with prefix-embedding tuning, where we insert lp + li special tokens whose embeddings are treated as trainable parameters. For more on prefixembedding tuning, see Section 5.1. LoRA+PrefixLayer (LoRA+PL) combines LoRA with prefix-layer tuning. ... In Table 15, we show the evaluation results of LoRA+PE and LoRA+PL on WikiSQL and MultiNLI. First of all, LoRA+PE significantly outperforms both LoRA and prefix-embedding tuning on WikiSQL, which indicates that LoRA is somewhat orthogonal to prefix-embedding tuning. On MultiNLI, the combination of LoRA+PE doesn’t perform better than LoRA, possibly because LoRA on its own already achieves performance comparable to the human baseline. ...
https://arxiv.org/abs/2106.09685
Isn't this "prefix-embedding tuning" the same as textual inversion?
I'll clean up my code and PR it.
Doesn't train both at once, but loads TI into the LoRA trainer and works quite well.
I've been messing with Poiuytrezay1's PR and my experience is the TI overfits on style quite quickly, so you probably want to train them separately anyway. The quality difference between PTI and LoRA alone wasn't worth switching for, but the TI behaves as a trigger word without the need for dreambooth-style regularization images. I'm sure you'll get bleed if you train the unet long enough, but that takes longer than most single concept loras are trained for.
I have an idea that I didn't had time to try.
network_train_unet_only
to freeze CLIP.Learning rate at 1. should be high. We don't care if the embedding breaks as-is. After 2. the embedding will "stop working" but still would "mean something" Step 3. can be hacked by dumping text latents to disk and patch them manually, adding embedding vectors. Everything else in 4. as normally. The LoRA will work only with the embedding, obviously.
I just used their other PR which ports cloneofsimo's code to normalize during training https://github.com/kohya-ss/sd-scripts/pull/993
A norm of 1 is probably already too high. IIRC the PTI authors found the embedding works best if it's at least somewhat close to other real embeddings. In this case that means initializing with an existing token (init_word) and keeping the norm close to 0.4
I thought the normalization during training will compromise its speed. My idea is to overtrain TI quickly and start LoRA!
Normalizing after training is not going to suddenly un-overfit it
It will "disable" the embedding, as if it wasn't trained at all. I played with normalization of my trained TI with my EM, and the result looked like it wasn't trained at all.
Which is what I want to try for LoRA instead of training CLIP or using a trigger word.
Is it possible to train a LoRA together with an Embedding? Here are some thoughts that came to this, when training a LoRA for an object:
sks
but not learnphoto
andforest
along?What do you think? Otherwise, I'm not quite sure how to train LoRA on something that is not a character nor a style. For example, to train a LoRA for "scar" concept: what descriptions should we choose? Should we say "sks over eye, 1boy, …"? If so, isn't it more logical to say directly "scar over eye, 1boy, …"? But if so, how can we be sure that only the concept of "scar" would be changed, and not the concept of "1boy"?