kohya-ss / sd-scripts

Apache License 2.0
5.22k stars 867 forks source link

Flux textual_inversion #1588

Open aiXander opened 1 month ago

aiXander commented 1 month ago

Is anybody already working on making a script for Flux Ti or wants to start working on one (I'm down to jump in!)?

Some thoughts:

Another research project I've been thinking about:

Littleor commented 1 month ago

This maybe work: https://github.com/Littleor/textual-inversion-script, but it require a large VRAM, I'm in the process of realizing a TI training script using 24GB VRAM.

recris commented 1 month ago

This would be a really nice feature to have. Currently, using multiple LoRAs in the same image causes some visible output degradation (screen door effect), unless we reduce the strength of the LoRAs.

However some concepts for which we currently have to use LoRAs may easily trained into embeddings, which shouldn't cause such degradation since DiT weights wouldn't be touched. A well trained embedding could potentially work even better than in SD models, given the powerful T5 encoder.

@kohya-ss Are there any plans to support textual inversion in the codebase? Yesterday I started an attempt to adapt the SDXL version but its been a bit of a struggle since Flux requires some significant changes. I would gladly provide assistance on this effort where I can.

Littleor commented 1 month ago

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

kohya-ss commented 1 month ago

@kohya-ss Are there any plans to support textual inversion in the codebase? Yesterday I started an attempt to adapt the SDXL version but its been a bit of a struggle since Flux requires some significant changes. I would gladly provide assistance on this effort where I can.

I think that TI training doesn't work on SDXL because I did a big refactoring on sd3 branch. I will make TI training on SDXL work first, so please wait for a while.

Littleor commented 1 month ago

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

@kohya-ss Now I have implemented the Textual Inversion training for the FLUX.1 dev model on a 24GB VRAM GPU, which may provide some help in implementing our codebase: https://github.com/Littleor/textual-inversion-script?tab=readme-ov-file#low-vram-usage.

What's more, the TI training for SDXL is working in this code: https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_sdxl.py, I hope this can be helpful.

aiXander commented 1 month ago

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

Awesome, is this doing Ti on both T5 and CLIP?

Littleor commented 1 month ago

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

Awesome, is this doing Ti on both T5 and CLIP?

Yes, this training on both T5 and CLIP.

sipie800 commented 1 month ago

Good thought! And we may discuss zeroshot-fewshot-manyshot ways for T2I.

IMO, the ways can be attributed with their learning shot and granularity.

ipadapter is the zeroshot one with coarse granularity. Particularly, face-id ipa needs much more higher granularity cause person face is a really high granular task. Face-id is worse than instand-id or pulid-id which ship higher granularity.

lora is the fewshot one with finer granular. Higher the rank goes, better the details go. However worse generalization may be a problem if the shots is too low. Follow the scaling law, you need high rank lora along, manyshot images and curated captions in the same time to lift your result.

full finetuning is the boss method theoretically but not practically. It provides highest granularity controling. But few people possess the high quality and large amount data it requires(And GPUs). We just find too many burned/washed-out finetuning T2I checkpoints at CIVITAI.

controlnet(other than ipa) is somehow a side path. It fixed the input modality. thus the scaling law is cut down thus we can train with much lower effort than base model. But it actually has more coarse granularity than lora or finetuning, you just don't want to make a "hand-fixer controlnet" etc.

IMO we may just want to sort out the conditions (text, vision embedding, vision modality map, hybrid information in unknown custom data...) to build better and universal paradigm for next generation of T2I. The flux ecology is a good start. But we are calling for lowshot methods in the future for sure. IMO lora is the one better than the others. People need to steer the T2I tasks in their private domain with no new model.

I believe flux is more suitable for text inversion than any other models for sub-id tasks. It emerges.

aiXander commented 1 month ago

@Littleor I've tested your Ti training repo but havent had any successes (it wont learn my concept at all), is it possible there are bugs left in the implementation or did it work on your end?

LilyDaytoy commented 2 weeks ago

Littleor

Hi, thanks for implementing this! but I cannot find the code now, could you please share it 😭 and may I ask your training time for textual inversion, mine got stuck and is extremely slow