kohya-ss / sd-scripts

Apache License 2.0
5.19k stars 862 forks source link

Multi GPU train of flux report error #1475

Open chongxian opened 2 months ago

chongxian commented 2 months ago

I use this setting below to train flux lora:

accelerate launch  --gpu_ids 0,1 --main_process_port 29502 --mixed_precision bf16 --num_cpu_threads_per_process=2 \
    flux_train_network.py --pretrained_model_name_or_path ${flux_model_path} \
    --clip_l ${clip_l_path} --t5xxl ${t5xxl_path}  --ae ${ae_path} \
    --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers \
    --max_data_loader_n_workers 2 --seed 42 \
    --gradient_checkpointing \
    --save_precision bf16 --mixed_precision bf16 \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --optimizer_type prodigy \
    --learning_rate 1 --network_train_unet_only \
    --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk \
    --highvram \
    --max_train_epochs 10   \
    --save_every_n_epochs 1 \
    --train_data_dir=${input_path} \
    --output_dir ${output_path}  \
    --output_name flux_shot \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1 --loss_type l2 \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --caption_extension=".txt" \
    --lr_scheduler="cosine" --lr_warmup_steps=396 --train_batch_size=4 --deepspeed --zero_stage=2 \
    --log_with="wandb" --wandb_run_name="shot2" --wandb_api_key="" --logging_dir=${output_path}"/logs" --log_tracker_name="flux_lora1" 

it will report the error like this : image

kohya-ss commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

terrificdm commented 2 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

BootsofLagrangian commented 2 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

Thank you for noticing. I'll check it out and add a comment.

chongxian commented 2 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

Thank you for noticing. I'll check it out and add a comment. would you have idea to solve this problem?

chongxian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

chongxian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command

accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"
BootsofLagrangian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

chongxian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

how to modify my command to run the flux_train.py?

terrificdm commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command

accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Ethan-niu commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

this my complete command

accelerate launch  --gpu_ids 0,1,2 --mixed_precision bf16 --num_cpu_threads_per_process 3 flux_train.py \
    --pretrained_model_name_or_path ${flux_model_path} --clip_l ${clip_l_path} --t5xxl ${t5xxl_path} --ae ${ae_path} --save_model_as safetensors \
    --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \
    --output_dir ${output_path} --output_name flux_dev  --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1  \
    --learning_rate 5e-5 --max_train_epochs 10 \
    --optimizer_type adamw8bit  \
    --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 \
    --cpu_offload_checkpointing \
    --resolution="1024,1024" --bucket_reso_steps=64 --bucket_no_upscale --min_bucket_reso=256 --max_bucket_reso=2048 --enable_bucket \
    --train_data_dir=${input_path} --caption_extension=".txt"  \
    --deepspeed --zero_stage=2 --full_bf16  --gradient_accumulation_steps=1 --cache_latents --offload_optimizer_device="cpu"

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed

Ethan-niu commented 2 months ago

The error was not caused by DDP multi-gpu training, but by DeepSpeed... The original DDP multi-gpu training was fine with Flux Lora training, but as long as you installed DeepSpeed and enabled in your training script by adding --deepspeed --zero_stage=2, it would throw an error "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16". Even though you configured mixed_precision for both accelerate and scripts with bf16, the error still existed. Maybe @BootsofLagrangian would like to take a look. Thanks.

i also meet the problem, but when i used the same command to sdxl_train.py, it is ok, so i think the flux has some problem with deepspeed

BootsofLagrangian commented 2 months ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

how to modify my command to run the flux_train.py?

DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.

kohya-ss commented 2 months ago

DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Could you try reducing the resolution to about 512x512?

kohya-ss commented 2 months ago

DDP or Deepspeed? If you wanna try to run flux_train.py with DDP, you have to fix some codes like this. In case of Deepspeed, still in working.

I think this issue is solved.

Ethan-niu commented 2 months ago

DDP seems to consume a lot of memory. I guess it's because the number of parameters is so large that the synchronization overhead is large, but I don't know why, so if anyone knows, please let me know.

Regarding flux_train.py, even remove --deepspeed --zero_stage=2, and just use the original DDP multi-gpu training of the script, you will still see the OOM error, no matter how you optimize your configurations as kohya mentioned in notes of flux_train.py, which is becaused some configurations only work for 1 GPU condition.

Could you try reducing the resolution to about 512x512?

I tried the resolution 512 with A100 80G, but still OOM

kohya-ss commented 2 months ago

The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass

Ethan-niu commented 2 months ago

The following options might work: --sdpa --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --full_bf16 --ddp_gradient_as_bucket --ddp_static_graph --cpu_offload_checkpointing --fused_backward_pass

Thank you very much, but i want to know when can use deepseed to finetune flux, using DDP can only with small batch and resulution

kohya-ss commented 2 months ago

I'm not familiar with DeepSpeed ​​so it will probably take a while.

Ethan-niu commented 2 months ago

I'm not familiar with DeepSpeed ​​so it will probably take a while. thank you, i found when train using 1 gpu,the vram only used 40g,but using the same config, two gpu training, the vram used 80g, why?

Ethan-niu commented 1 month ago

Updated the sd3 branch. Multi-GPU training should now work. Please report again if the issue remains.

I have four A100-40G,Is it feasible to train flux model with multiple graphics cards?I've been having problems with OOM, but when I add this command like --deepspeed --zero_stage=2 --offload_optimizer_device="cpu" , it will report the same errors like "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16"

In sd-scripts, autocast is enabled. But for some reason I don't know, autocasting for deepspeed model is not work. I think it takes times. But, other implemantion of ZeRO, FSDP, not implemented in sd-script, works.

I trained sdxl with deepspeed is ok by sd-scripts, but flux is not, so i think the flux code has some bugs?