kohya-ss / sd-scripts

Apache License 2.0
4.96k stars 835 forks source link

Feature request: separate captions for SDXL text encoders #781

Closed RockTheCosmos closed 7 months ago

RockTheCosmos commented 1 year ago

From what I have read about the 2 text encoders that SDXL uses, the G CLIP encoder is better at understanding natural human language/full sentences, whereas the L CLIP encoder isn't as advanced so it's better to use more simplified language like tags separated with commas. From this, I deduce that it would be better to feed a WD14 caption file to the L CLIP encoder and a BLIP caption file to the G CLIP encoder.

Looking through the code, it looks like kohya-ss is currently just taking the caption from a single file and throwing that caption to both text encoders. I think it would be more effective to make it so the program can handle 2 caption files for each image, one intended for one text encoder and one intended for the other. Or you could just use a single caption file and provide a way to demarcate which part of the caption goes to one text encoder vs. the other.

It's recommended that we shouldn't even train the text encoders due to unpredictable results based on the two text encoders. I think what I suggested would go a long way toward making the results more predictable.

Basically, I would just like a way to feed the program different captions for each image that are intended for one text encoder vs. the other.

mio-nyan commented 10 months ago

I have implemented a local test version to see if it makes a difference. Short Answer: YES!

After some research, providing the G CLIP encoder captions about the content of the image and the L CLIP encoder only a single word for a style (eg. "photographic" or "painted") works "best" (it has the most predictable outcome)

I tried feeding WD14 captions to the L encoder, and BLIP captions to the G encoder, but the results where way worse than only a "style" word for the L encoder and the rest for the G.

In order to prove it during the training, I also modified the sample image generation (seperate prompts for G and L).

I will try to make representative demo images...

dill-shower commented 10 months ago

All users are looking forward to examples and pull requests. Doesn't G encoder only understand natural language? Remilia Scarlett sits on a throne in the castle, and not separate tags? I read in some article that it is more effective to use L for individual tags, and G only for natural language captions

mio-nyan commented 10 months ago

Yes, i tried training with natural language for G (like your example: Remilia Scarlett sits on a throne in the castle) and a single style tag for L (photographic) and it works really good! (By using them afterwards in the prompt) Code is all messy by now, but I try at least to make some sample images to show the difference

bluvoll commented 10 months ago

I have implemented a local test version to see if it makes a difference. Short Answer: YES!

After some research, providing the G CLIP encoder captions about the content of the image and the L CLIP encoder only a single word for a style (eg. "photographic" or "painted") works "best" (it has the most predictable outcome)

I tried feeding WD14 captions to the L encoder, and BLIP captions to the G encoder, but the results where way worse than only a "style" word for the L encoder and the rest for the G.

In order to prove it during the training, I also modified the sample image generation (seperate prompts for G and L).

I will try to make representative demo images...

Honest question, how did you manage to feed diferent captions to each TE, and since I'm pretty dumb, which TE is which in kohya? TE1 is G and TE2 is L or viceversa? You might just become the hero for all of us folks wanting to feed different captions to each TE!

mio-nyan commented 10 months ago

Just writing a small sumup about the splitting, but for now:

[EDIT] from debugging koyha & comfy: text_encoder1: CLIPTextModel - 'openai/clip-vit-large-patch14' / L / in Comfy CLIP_L text_encoder2: CLIPTextModelWithProjection - 'laion/CLIP-ViT-bigG-14-laion2B-39B-b160k' / G / in Comfy CLIP_G

bluvoll commented 10 months ago

Just writing a small sumup about the splitting, but for now:

[EDIT] from debugging koyha & comfy: text_encoder1: CLIPTextModel - 'openai/clip-vit-large-patch14' / L / in Comfy CLIP_L text_encoder2: CLIPTextModelWithProjection - 'laion/CLIP-ViT-bigG-14-laion2B-39B-b160k' / G / in Comfy CLIP_G

Thank you very much!

suede299 commented 10 months ago

Would love to participate in this test, and it would be great to have a usable code update.

mio-nyan commented 10 months ago

I hope to have something by the end of the next week. It's a hell of debugging and there are so many questions involved.

mio-nyan commented 10 months ago

Ok guys, as promised!

First here are some sample images / studies in order to demonstrate that the separation of captions is useful. I think the pictures speak for themselves.

TrainingImages Training Sample

Since the code is far away from a pull request (breaking changes, and i don't know how many side effects my stuff will have), you can check out the fork on my dev branch: https://github.com/mio-nyan/sd-scripts/tree/dev

this is the version I used to create the samples above. Hope the readme helps. I have to sleep now xD

@kohya-ss you did an amazing job! I can see how much work you have put into it.

mio-nyan commented 10 months ago

Note regarding the Training Samples: These we're only 500 Steps for Demo purpose.

suede299 commented 10 months ago

It looks like the file structure is all different, I wonder how long it will take to merge over here

mio-nyan commented 10 months ago

To be honest, I don't know exactly how to integrate the whole thing properly. Since the training script already processes captions and tags separately, I kept them separately until input_ids creation instead of merging them earlier. This alone wouldn't be that much of a change.

It's more about the concept. Atm there are things like tag shuffling / drop captions / ... which adds additional randomness to the whole training process. Moreover captioning is mixed throughout the code for 1.5, 2.x and sdxl. So it would make sense to split the sdxl from the rest. It also affects LORA / dreambooth ..

For the 2 extra files: I just copied merge_captions_to_metadata.py merge_dd_tags_to_metadata.py and changed it to create a metafile with captionG and captionL which is easier to understand than caption and tags

mio-nyan commented 9 months ago

I'm so stupid. Sorry guys!

i have separated the captions up to the input id's, but in the end the embeddings are concatinated again. Why did I overlook this line? text_embedding = torch.cat([encoder_hidden_states1, encoder_hidden_states2], dim=2).to(weight_dtype)

And I made a mistake at: input_ids1 = self.get_input_ids(caption, self.tokenizers[0]) input_ids2 = self.get_input_ids(caption, self.tokenizers[1]) I assumed that it's wrong to use the same text for both encoders, but in the end each encoder outputs a different embedding.

Either way, they are treated together in the UNet.

mio-nyan commented 9 months ago

I think this issue can be closed.

RockTheCosmos commented 9 months ago

I'm afraid I don't follow. Are you saying that whatever individual captions you put for each text encoder, the result of the training is going to be the same as if you concatenated both captions together at the start and fed that to both encoders? Then what about the results you talked about earlier where the separate captions improved the quality of the output images? I'm confused.

mio-nyan commented 9 months ago

Trust me, I'm confused as well, but..

@kohya-ss pls correct me if I'm wrong:

There are 2 text encoders. At the moment, the caption for an image is fed to both text encoder 1 and text encoder 2. Although it is the same caption, different embeddings are produced due to the different text encoders. These are concatinated before they are passed to the UNET. The UNET itself only works with "one caption".

This means that 2 "interpretations" of the captions are simply appended together and transferred to the UNET. Of course it makes a difference if you give both text encoders different captions, because different embeddings come out (But different embeddings come out anyway).

In my experiment it actually made a difference. Maybe because you can respond more to the peculiarities of a text encoder. But it would take a lot more trials to confirm that it actually works better. And then even more so to find a reason why.

But in the end SDXL was trained on the combination of both captions. (At least that's what I heard and it makes sense somehow). To summarize: I think "SDXL uses 2 text encoders" is a bit misinterpreted. It is rather "SDXL uses the composite input of 2 text encoders".

mio-nyan commented 9 months ago

I have trained a few attempts where at least the embeddings were made with different captions. For some reason this worked well. But no idea why.

kohya-ss commented 9 months ago

@mio-nyan I think your understanding is correct.

zjysteven commented 8 months ago

@mio-nyan Do you mind further sharing the details of your last few attempts of feeding different captions to the 2 text encoders?

mio-nyan commented 8 months ago

Sure!

To summerize until now, i compared the following approaches:

Version 1 (G) woman, upper body, front view (L) photographic

Version 2 (default) (G) woman, upper body, front view, photographic (L) woman, upper body, front view, photographic

which means: One time using the (L) encoder only for a style.

To see it if really improves anything, I did longer training runs with following results:

For version 1 ((L) style tag only) in the first couple steps/epochs it looks promising BUT! over a longer training run it starts lacking saturation and "volume" (don't know how to describe it better). So it seams proper descriptions for both will really enhance the training. But since captioning itself is some kind of science, i would'nt focus too much on what each text encoder understands best, instead trying to find captions that work for both at the same time. And don't forget sdxl seamed to be trained on feeding both the same caption (which are different embeddings anyway)

Actually I don't really have a lot to summerize because I went back to the default variant feeding the same captions to both text encoders. The reason is: there are several other aspects which have waay more impact than using different captions to create the embeddings. (Like the captions itself (manually captioning instead of auto captioning), or picking the right images for a dataset, maybe custom cropping to preserve quality and so on..)

zjysteven commented 8 months ago

Thank you for sharing these; it's quite interesting. Agreed that these other aspects probably affect equally or more, but feeding different captions to better "trigger" the text encoders definitely makes sense too.

DarkAlchy commented 8 months ago

@mio-nyan A programming question. I have my G and L encoded tokens but I am having a hard time combining them into the embedding for inference. L (TE1) is the classic 1,77,768 while G (TE2) is the size of 1, 1280. Is that even right? How do I encode the token1, and token2, to make the embediing? In <XL it was simple.

mio-nyan commented 8 months ago

Can you explain in more detail? If you follow the existing code flow, there is already support for captions & tags. If you look at class ImageInfo, I started passing caption_l & caption_g instead of only captions and keep them separated the whole process (just search where captions are processed)

PS: I did that only for finetuning

DarkAlchy commented 8 months ago

It was a simple request, since it appears SDXL is being held hostage by the gatekeepers. 1.4 to 2.1 I could find the information I asked for very easily. You know precisely what it is I am asking and had the knowledge I needed so I asked. Text prompt. Text encode to tokens to text embeddings makes what the diffuser/pipe needs to make an image.

mio-nyan commented 8 months ago

I can't really follow. If you talk about the diffusion pipeline the docu is here: https://huggingface.co/docs/diffusers/using-diffusers/sdxl

my attempts are here (i went through kohya's scripts and tried the best to my understandings, with all the stuff mentioned above): https://github.com/kohya-ss/sd-scripts/commit/517052d3ff1662d63026a0f5b416dc9e2579d9b3

I found a somehow detailed analysis (besides the official papers) here: https://zhuanlan.zhihu.com/p/643420260

Have a look at train_util.py and sdxl_train.py There you will find processing of inputIds, hiddenStates, getting the embeddings...

DarkAlchy commented 8 months ago

What is different for XL than the rest besides just TE2? I can use the same routines for TE1 I did for previous versions and they all work but use the same routine with TE2 and I get the right tensor shape, I get the right tokens but it all blows up. I suspect it blows up because of my concat

C = torch.cat((B1,B2),dim=-1).to(TORCH_DEVICE, weight_dtype)

B1 is the tokens from tokenizer 1 that have been made into embeddings, and B2 is the tokens from tokenizer 2 made into embeddings. I am using the same text prompt for both. B1/B2 = text_encoder(tokens.input_ids.to(DEVICE)).last_hidden_state

I know the text_coder works as I can change to 1.5 (for instance) and the same line of code works which makes me think it is C being the issue as it becomes non-iterable.

edit: I tried C = torch.cat([B1,B2],dim=-1).to(TORCH_DEVICE, weight_dtype) and the same error message: "TypeError: argument of type 'NoneType' is not iterable" which when you trace it back goes to C being the issue.

DarkAlchy commented 8 months ago

I finally got something to work but I had to hack the unet because of SDXL and the pipe sucking. It turned on some text_embeds stuff that forces me to do something without knowing what to do. Fixed with a setting post load. What I have not figured out is why my new routines are not generating anything remotely what I say. Prompt is "An old man" (for instance) and what I get is so abstract it is nothing what the prompt says. I changed the prompt and the same thing so this crap isn't working out as they made SDXL like dung and way harder than it ever was before or needed to be.

mio-nyan commented 8 months ago

Do you have a repo? It's quite hard to follow you without seeing any code.

DarkAlchy commented 8 months ago

No, because I can't get this to work. This works 100% for all but XL but to simply do TE2 like this does not work in XL.

prompt = "An old man"
def text_enc(prompt, tokenizer, maxlen=None):
    '''
    A function to take a texual prompt and convert it into tokens
    '''
    if maxlen is None: maxlen = tokenizer.model_max_length
    return tokenizer(prompt, padding="max_length",  max_length=maxlen, truncation=True, return_tensors="pt")
def text_emb(text,  text_encoder, DEVICE, dtype):
    '''
    A function to take tokens and convert it into an embedding
    '''
    return text_encoder(text.input_ids.to(DEVICE))[0].to(dtype)

targ = text_enc(prompt, tokenizer)
targ_emb = text_emb(targ, text_encoder, TORCH_DEVICE, weight_dtype)
neg_prompt = "black and white, blurry, malformed, splitscreen, text, watermark, text, signature, open mouth,teeth, malformed text"
if neg_prompt: uncon_denoised = text_enc(neg_prompt, tokenizer, maxlen=targ.input_ids.shape[1])
else: uncon_denoised = text_enc(neg_prompt * batch_size, tokenizer, maxlen=targ.input_ids.shape[1])
uncon_denoised_emb = text_emb(uncon_denoised, text_encoder, TORCH_DEVICE, weight_dtype)
FINAL = torch.cat([uncon_denoised_emb, targ_emb]).repeat_interleave(n_imgs, dim=0)

There is your code example that works for 1.2/1.4/1.5/2.0/2.1, but XL requires something more beyond just doing the same thing with TE2 then you just torch.cat(TE1,TE2). No real info for XL about this that I could find, and XL is six months old now,

DarkAlchy commented 7 months ago

Kohya is falling behind the curve more and more, until it will be relegated to history if not careful.