Open chongxian opened 2 months ago
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding
--deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.
Thank you for noticing. I'll check it out and add a comment.
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding
--deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.Thank you for noticing. I'll check it out and add a comment. would you have idea to solve this problem?
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"
this my complete command
accelerate launch --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
--pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
--sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
--output_dir ${output_path} --output_name flux_dev --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \
--learning_rate 5e-5 --max_train_epochs 10 \
--optimizer_type adamw8bit \
--timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
--cpu_offload_checkpointing \
--resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
--train_data_dir=${input_path} --caption_extension=".txt" \
--deepspeed --zero_stage=2 --full_bf16 --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"
In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
how to modify my command to run the flux_train.py?
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"this my complete command
accelerate launch --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \ --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \ --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --output_dir ${output_path} --output_name flux_dev --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \ --learning_rate 5e-5 --max_train_epochs 10 \ --optimizer_type adamw8bit \ --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \ --cpu_offload_checkpointing \ --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \ --train_data_dir=${input_path} --caption_extension=".txt" \ --deepspeed --zero_stage=2 --full_bf16 --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Regarding flux_train.py
, even remove --deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py
, which is becaused some configurations only work for 1 GPU condition.
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"this my complete command
accelerate launch --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \ --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \ --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --output_dir ${output_path} --output_name flux_dev --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 \ --learning_rate 5e-5 --max_train_epochs 10 \ --optimizer_type adamw8bit \ --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \ --cpu_offload_checkpointing \ --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \ --train_data_dir=${input_path} --caption_extension=".txt" \ --deepspeed --zero_stage=2 --full_bf16 --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Regarding
flux_train.py
, even remove--deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes offlux_train.py
, which is becaused some configurations only work for 1 GPU condition.
i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed
The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding
--deepspeed --zero_stage=2
, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.
i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
how to modify my command to run the flux_train.py?
DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.
DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.
Regarding
flux_train.py
, even remove--deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes offlux_train.py
, which is becaused some configurations only work for 1 GPU condition.
Could you try reducing the resolution to about 512x512?
DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.
I think this issue is solved.
DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.
Regarding
flux_train.py
, even remove--deepspeed --zero_stage=2
, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes offlux_train.py
, which is becaused some configurations only work for 1 GPU condition.Could you try reducing the resolution to about 512x512?
I tried the resolution 512 with A100 80G, but still OOM
The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass
The following options might work:
--sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass
Thank you very much, but i want to know when can use deepseed to finetune flux, using DDP can only with small batch and resulution
I'm not familiar with DeepSpeed so it will probably take a while.
I'm not familiar with DeepSpeed so it will probably take a while. thank you, i found when train using 1 gpu,the vram only used 40g,but using the same config, two gpu training, the vram used 80g, why?
Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.
I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like
--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"
, it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.
I trained sdxl with deepspeed is ok by sd-scripts, but flux is not, so i think the flux code has some bugs?
I use this setting below to train flux lora:
it will report the error like this :