Open chongxian opened 3 months ago
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding
--deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.
Thank you for noticing. I'll check it out and add a comment.
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding
--deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.Thank you for noticing. I'll check it out and add a comment. would you have idea to solve this problem?
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"
this my complete command
accelerate launch --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
--pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
--sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
--output_dir ${output_path} --output_name flux_dev --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \
--learning_rate 5e-5 --max_train_epochs 10 \
--optimizer_type adamw8bit \
--timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
--cpu_offload_checkpointing \
--resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
--train_data_dir=${input_path} --caption_extension=".txt" \
--deepspeed --zero_stage=2 --full_bf16 --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"
In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
how to modify my command to run the flux_train.py?
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"this my complete command
accelerate launch --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \ --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \ --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --output_dir ${output_path} --output_name flux_dev --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \ --learning_rate 5e-5 --max_train_epochs 10 \ --optimizer_type adamw8bit \ --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \ --cpu_offload_checkpointing \ --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \ --train_data_dir=${input_path} --caption_extension=".txt" \ --deepspeed --zero_stage=2 --full_bf16 --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Regarding flux_train.py
, even remove --deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py
, which is becaused some configurations only work for 1 GPU condition.
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"this my complete command
accelerate launch --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \ --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \ --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --output_dir ${output_path} --output_name flux_dev --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \ --learning_rate 5e-5 --max_train_epochs 10 \ --optimizer_type adamw8bit \ --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \ --cpu_offload_checkpointing \ --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \ --train_data_dir=${input_path} --caption_extension=".txt" \ --deepspeed --zero_stage=2 --full_bf16 --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Regarding
flux_train.py
, even remove--deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes offlux_train.py
, which is becaused some configurations only work for 1 GPU condition.
i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding
--deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.
i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
how to modify my command to run the flux_train.py?
DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.
DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.
Regarding
flux_train.py
, even remove--deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes offlux_train.py
, which is becaused some configurations only work for 1 GPU condition.
Could you try reducing the resolution to about 512x512?
DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.
I think this issue is solved.
DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.
Regarding
flux_train.py
, even remove--deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes offlux_train.py
, which is becaused some configurations only work for 1 GPU condition.Could you try reducing the resolution to about 512x512?
I tried the resolution 512 with A100 80G, but still OOM
The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass
The following options might work:
--sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass
Thank you very much, but i want to know when can use deepseed to finetune flux, using DDP can only with small batch and resulution
I'm not familiar with DeepSpeed so it will probably take a while.
I'm not familiar with DeepSpeed so it will probably take a while. thank you, i found when train using 1 gpu,the vram only used 40g,but using the same config, two gpu training, the vram used 80g, why?
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
I trained sdxl with deepspeed is ok by sd-scripts, but flux is not, so i think the flux code has some bugs?
@BootsofLagrangian Hi there, hope that i can reach out to you. I also get this dtype error when training flux lora with deepspeed multigpu. Do you have any updates maybe no what it might be? Thank you for your time!
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16
@BootsofLagrangian Hi there, hope that i can reach out to you. I also get this dtype error when training flux lora with deepspeed multigpu. Do you have any updates maybe no what it might be? Thank you for your time!
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16
This is quite tricky problem. It might be cause by input of models (probably cached token, Float), but autocast must handle this in context manager. Sorry for this
@terrificdm With the RTX3090(24GB), image resolution=1024 condition, the multi-gpu Flux finetuning is OOM, have you ever occur the same problem? Thanks a lot!My config script is :
@terrificdm With the RTX3090(24GB), image resolution=1024 condition, the multi-gpu Flux finetuning is OOM, have you ever occur the same problem? Thanks a lot!My config script is :
I have the same issue, I think this is related to GPU size?
I use this setting below to train flux lora:
it will report the error like this :