kohya-ss / sd-scripts

Apache License 2.0
5.31k stars 880 forks source link

Support FLUX series models #1445

Open ddpasa opened 3 months ago

ddpasa commented 3 months ago

These models have just been released and appear to be amazing. Links below:

Blog from fal.ai: https://blog.fal.ai/flux-the-largest-open-sourced-text2img-model-now-available-on-fal/

Huggingface: https://huggingface.co/black-forest-labs

There is a schell version and dev version.

ssube commented 3 months ago

@D3voz the high memory usage with some LoRAs has been reported in the ComfyUI repo: https://github.com/comfyanonymous/ComfyUI/issues/4343 . The slow inference is specific to Windows and the way it uses shared GPU memory, on Linux it simply runs out of memory (or since the recent updates, partially loads the LoRA with undefined results).

DarkAlchy commented 3 months ago

Long thread, and I am late to the party. I want no part of SD3, or SAI, but we now have other options we can train on as good as Flux is that is truly open? I, along with a lot of lawyers, don't like their licence, but we all wait for clarification from them (if it comes). I have a 4090 and training at BS1 will be the one thing to cause me to toss in the towel on all this. A lora taking 90 minutes to train without even the clip (that I drastically need for what I do) is far too long. If this is the future, then I rather rot in antiquity.

ddpasa commented 3 months ago

Long thread, and I am late to the party. I want no part of SD3, or SAI, but we now have other options we can train on as good as Flux is that is truly open? I, along with a lot of lawyers, don't like their licence, but we all wait for clarification from them (if it comes). I have a 4090 and training at BS1 will be the one thing to cause me to toss in the towel on all this. A lora taking 90 minutes to train without even the clip (that I drastically need for what I do) is far too long. If this is the future, then I rather rot in antiquity.

SD3 is a terrible model with very complicated legal mess of a license. Given how much better Flux.1 is, I see no reason to waste any of my time on SD3. Stability AI is a huge mess right now, and it's really good that new models are coming to the scene.

DarkAlchy commented 3 months ago

Long thread, and I am late to the party. I want no part of SD3, or SAI, but we now have other options we can train on as good as Flux is that is truly open? I, along with a lot of lawyers, don't like their licence, but we all wait for clarification from them (if it comes). I have a 4090 and training at BS1 will be the one thing to cause me to toss in the towel on all this. A lora taking 90 minutes to train without even the clip (that I drastically need for what I do) is far too long. If this is the future, then I rather rot in antiquity.

SD3 is a terrible model with very complicated legal mess of a license. Given how much better Flux.1 is, I see no reason to waste any of my time on SD3. Stability AI is a huge mess right now, and it's really good that new models are coming to the scene.

I agree. I thought there was something else as I am not going back to SAI if I can help it.

bghira commented 3 months ago

flux and sd3 have the same license, but both are unenforceable anyway. just have fun and do less drama.

DarkAlchy commented 3 months ago

You know, there are adults who make a living and don't wish to spend it all on lawyers and court costs defending ourselves. It isn't about just popping out some waifu/husbando.

bghira commented 3 months ago

what does it have to do with this thread? act like an adult if you are one. some of us are researchers who dont care about how open a model is and can make a living regardless.

cosmicoxytocin commented 3 months ago

You know, there are adults who make a living and don't wish to spend it all on lawyers and court costs defending ourselves. It isn't about just popping out some waifu/husbando.

I can assure you, researchers in the field of image-synthesis (outside Medical) are not moral busybodies opposed to waifus and husbandos. Regardless, your opinion is irrelevant to the discussion. Go write a tumblr blog about it, if you must.

bash-j commented 3 months ago
Traceback (most recent call last):
  File "/home/mikey/kohya_ss/sd-scripts/finetune/prepare_buckets_latents.py", line 286, in <module>
    main(args)
  File "/home/mikey/kohya_ss/sd-scripts/finetune/prepare_buckets_latents.py", line 89, in main
    vae = model_util.load_vae(args.model_name_or_path, weight_dtype)
  File "/home/mikey/kohya_ss/sd-scripts/library/model_util.py", line 1304, in load_vae
    converted_vae_checkpoint = convert_ldm_vae_checkpoint(vae_sd, vae_config)
  File "/home/mikey/kohya_ss/sd-scripts/library/model_util.py", line 429, in convert_ldm_vae_checkpoint
    new_checkpoint["quant_conv.weight"] = vae_state_dict["quant_conv.weight"]

What is this quant_conv.weight it is trying to find in the vae? These are keys from the SDXL VAE no? Not flux. I can't see it in the file. It looks like it's ignoring that I provided a path to the vae file also and trying to load it from the flux model file.

popovidis commented 3 months ago

Flux would be awesome

kohya-ss commented 3 months ago

What is this quant_conv.weight it is trying to find in the vae? These are keys from the SDXL VAE no? Not flux. I can't see it in the file. It looks like it's ignoring that I provided a path to the vae file also and trying to load it from the flux model file.

Sorry, prepare_buckets_latents.py doesn't support FLUX yet.

bghira commented 3 months ago

4 hours for 1600 steps is really really slow. you can rent a $0.200/hr 4090 and train 1600 steps in one hour. for less than $1. it probably costs you more per-kWh than it would to rent a cloud GPU:

Epoch 5/8, Steps:  99%|██████████▊| 9946/10000 [11:43:08<02:46,  3.09s/it, lr=6e-5, step_loss=0.384]

training on 10x 3090s for $2.20/hr. total = $25

thanks runpod.

hablaba commented 3 months ago

I tried to see if I can train Lora with Prodigy and it appears to work… weirdly only using ~17GB of VRAM, so not much more than when I was using AdamW8bit. That seems… wrong. Is there anything I’m missing on why Prodigy would “work” but potentially not be doing what I expect?

hablaba commented 3 months ago

Well, I can confirm training with prodigy works great. Just using my same settings as what I’d use in SDXL, setting d_coef to 2, betas 0.9,0.999, weight_decay 0.01. No warmup steps. Was able to get a great Lora of my dog at 1000 steps in about 35 mins on a 4090 (~2.2s/it) All other settings the same as the default recommendation on sd3 readme (except LR is 1 of course). Also used dim 16 and alpha 16 to match it to remove any scaling. Only used 17GB VRAM.

Got better results than my attempted training using AdamW8bit constant for 3000 steps at 1e-4 and 1e-3. In those it seemed both undertrained (sometimes prompted other unrelated things) and also overfit (realistic style when prompting line art or comic book style)

markrmiller commented 2 months ago

Just to note for anyone else that’s been trying this or is just starting - at least for me, if you go with full bf16 and the fused backward pass option, I get messed up hands and then distorted body parts or bodies very quickly regardless of learning rate. No full bf16 and the fused optimizer option and I don’t get that. Surprisingly the latter also appears to use less VRAM, certainly not any more, and is faster. Of course, keep in mind things are changing, but that’s been my experience the last few days.

This is for full finetune by the way, not Lora. Have not tried a Lora yet with this repo.

oovm commented 2 months ago

Can you add an script to quantify checkpoint (BF16) to NF4? Many users do not have powerful hardware. I hope my model can be used for more people.

wogam commented 2 months ago

I tried to see if I can train Lora with Prodigy and it appears to work… weirdly only using ~17GB of VRAM, so not much more than when I was using AdamW8bit. That seems… wrong. Is there anything I’m missing on why Prodigy would “work” but potentially not be doing what I expect?

What settings are you using to use such low VRAM? AdamW8bit training with 512px images is using almost 24gb ram for me.

Edit: if you have any caption drop out and it doesn't cache the captions, the text encoders are loaded into the GPU memory during training time which leads to the high GPU usage.

FurkanGozukara commented 2 months ago

I tried to see if I can train Lora with Prodigy and it appears to work… weirdly only using ~17GB of VRAM, so not much more than when I was using AdamW8bit. That seems… wrong. Is there anything I’m missing on why Prodigy would “work” but potentially not be doing what I expect?

What settings are you using to use such low VRAM? AdamW8bit training with 512px images is using almost 24gb ram for me.

Edit: if you have any caption drop out and it doesn't cache the captions, the text encoders are loaded into the GPU memory during training time which leads to the high GPU usage.

with adafactor and 512px i go as low as 7.5 GB : https://youtu.be/nySGu12Y05k

iamrohitanshu commented 2 months ago

@wogam "The training can be done with 12GB VRAM GPUs with Adafactor optimizer, --split_mode and train_blocks=single options." according to the readme file.