Multi GPU train of flux report error

chongxian commented 3 months ago

I use this setting below to train flux lora:

accelerate launch  --gpu_ids 0,1 --main_process_port 29502 --mixed_precision bf16 --num_cpu_threads_per_process=2 \
    flux_train_network.py --pretrained_model_name_or_path ${flux_model_path} \
    --clip_l ${clip_l_path} --t5xxl ${t5xxl_path}  --ae ${ae_path} \
    --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers \
    --max_data_loader_n_workers 2 --seed 42 \
    --gradient_checkpointing \
    --save_precision bf16 --mixed_precision bf16 \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --optimizer_type prodigy \
    --learning_rate 1 --network_train_unet_only \
    --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk \
    --highvram \
    --max_train_epochs 10   \
    --save_every_n_epochs 1 \
    --train_data_dir=${input_path} \
    --output_dir ${output_path}  \
    --output_name flux_shot \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1 --loss_type l2 \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --caption_extension=".txt" \
    --lr_scheduler="cosine" --lr_warmup_steps=396 --train_batch_size=4 --deepspeed --zero_stage=2 \
    --log_with="wandb" --wandb_run_name="shot2" --wandb_api_key="" --logging_dir=${output_path}"/logs" --log_tracker_name="flux_lora1"

it will report the error like this :

kohya-ss commented 3 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

terrificdm commented 3 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

BootsofLagrangian commented 3 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

Thank you for noticing. I'll check it out and add a comment.

chongxian commented 2 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

Thank you for noticing. I'll check it out and add a comment. would you have idea to solve this problem?

chongxian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

chongxian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command

accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"

BootsofLagrangian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

chongxian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

how to modify my command to run the flux_train.py?

terrificdm commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command

accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Ethan-niu commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command
accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed

Ethan-niu commented 2 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed

BootsofLagrangian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

how to modify my command to run the flux_train.py?

DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.

kohya-ss commented 2 months ago

DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Could you try reducing the resolution to about 512x512?

kohya-ss commented 2 months ago

DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.

I think this issue is solved.

Ethan-niu commented 2 months ago

DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Could you try reducing the resolution to about 512x512?

I tried the resolution 512 with A100 80G, but still OOM

kohya-ss commented 2 months ago

The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass

Ethan-niu commented 2 months ago

The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass

Thank you very much, but i want to know when can use deepseed to finetune flux, using DDP can only with small batch and resulution

kohya-ss commented 2 months ago

I'm not familiar with DeepSpeed so it will probably take a while.

Ethan-niu commented 2 months ago

I'm not familiar with DeepSpeed so it will probably take a while. thank you, i found when train using 1 gpu,the vram only used 40g,but using the same config, two gpu training, the vram used 80g, why?

Ethan-niu commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

I trained sdxl with deepspeed is ok by sd-scripts, but flux is not, so i think the flux code has some bugs?

kunibald413 commented 2 weeks ago

@BootsofLagrangian Hi there, hope that i can reach out to you. I also get this dtype error when training flux lora with deepspeed multigpu. Do you have any updates maybe no what it might be? Thank you for your time!

RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

BootsofLagrangian commented 2 weeks ago

@BootsofLagrangian Hi there, hope that i can reach out to you. I also get this dtype error when training flux lora with deepspeed multigpu. Do you have any updates maybe no what it might be? Thank you for your time!
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

This is quite tricky problem. It might be cause by input of models (probably cached token, Float), but autocast must handle this in context manager. Sorry for this

wanglaofei commented 1 week ago

@terrificdm With the RTX3090(24GB), image resolution=1024 condition, the multi-gpu Flux finetuning is OOM, have you ever occur the same problem? Thanks a lot！My config script is :

yurujaja commented 2 days ago

@terrificdm With the RTX3090(24GB), image resolution=1024 condition, the multi-gpu Flux finetuning is OOM, have you ever occur the same problem? Thanks a lot！My config script is :

I have the same issue, I think this is related to GPU size?

kohya-ss / sd-scripts

Multi GPU train of flux report error #1475