kohya-ss / sd-scripts

Apache License 2.0
5.32k stars 881 forks source link

error when use deepspeed for FLUX.1 fine-tuning #1591

Open huxian0402 opened 2 months ago

huxian0402 commented 2 months ago

@kohya-ss @lansing @rockerBOO @akx @tsukimiya With the following configuration, multi-GPU training works properly, and the results are normal. Does sd-scripts not support DeepSpeed acceleration? Could you help me check it?

CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 accelerate launch --config_file base_configs/accelerate_config.yaml --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py \ --pretrained_model_name_or_path /home/weights/FLUX.1-dev/flux1-dev.sft \ --clip_l /home/weights/FLUX.1-dev/clip_l.safetensors \ --t5xxl /home/weights/FLUX.1-dev/t5xxl_fp16.safetensors \ --ae /home/weights/FLUX.1-dev/ae.sft \ --resolution "1280,768" \ --enable_bucket \ --min_bucket_reso 256 \ --max_bucket_reso 1280 \ --bucket_reso_steps 64 \ --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 \ --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --train_data_dir ${TRAIN_DATA_DIR} --output_dir out_weights/${EXP_NAME} --output_name flux-sft \ --learning_rate 1e-5 --max_train_epochs 100 --save_every_n_epochs 5 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk \ --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \ --lr_scheduler constant_with_warmup --max_grad_norm 0.0 \ --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 \ --fused_backward_pass --cpu_offload_checkpointing --full_bf16

However, the downside is that the maximum batch size per GPU can only be set to 1. I tried using DeepSpeed to reduce memory usage and increase the batch size, but the following error occurred: enable full bf16 training.
rank4: Traceback (most recent call last):
rank4: File "/home/project/sd-scripts_sd3/flux_train.py", line 905, in

rank4: File "/home/project/sd-scripts_sd3/flux_train.py", line 427, in train
rank4: flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks])
rank4: File "/home/work/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1248, in prepare
rank4: raise ValueError("You can't customize device placements with DeepSpeed or Megatron-LM.")
rank4: ValueError: You can't customize device placements with DeepSpeed or Megatron-LM.
rank1: Traceback (most recent call last):
rank1: File "/home/project/sd-scripts_sd3/flux_train.py", line 905, in

rank1: File "/home/project/sd-scripts_sd3/flux_train.py", line 427, in train
rank1: flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks])
rank1: File "/home/work/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1248, in prepare
rank1: raise ValueError("You can't customize device placements with DeepSpeed or Megatron-LM.")
rank1: ValueError: You can't customize device placements with DeepSpeed or Megatron-LM.

Here is the DeepSpeed configuration file deepspeed_config.yaml compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 2 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 6 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false main_process_port: 29501

Here is the training command you are referring to: CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 accelerate launch --config_file base_configs/deepspeed_config.yaml --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py \ --pretrained_model_name_or_path /home/weights/FLUX.1-dev/flux1-dev.sft \ --clip_l /home/weights/FLUX.1-dev/clip_l.safetensors \ --t5xxl /home/weights/FLUX.1-dev/t5xxl_fp16.safetensors \ --ae /home/weights/FLUX.1-dev/ae.sft \ --train_batch_size 1 \ --resolution "1280,768" \ --enable_bucket \ --min_bucket_reso 256 \ --max_bucket_reso 1280 \ --bucket_reso_steps 64 \ --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 \ --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --train_data_dir ${TRAIN_DATA_DIR} --output_dir out_weights/${EXP_NAME} --output_name flux-sft \ --learning_rate 1e-5 --max_train_epochs 100 --save_every_n_epochs 5 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk \ --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \ --lr_scheduler constant_with_warmup --max_grad_norm 0.0 \ --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 \ --fused_backward_pass --cpu_offload_checkpointing --full_bf16

huxian0402 commented 1 month ago

@kohya-ss Could you please help check this issue? FLUX is unable to use multi-node training because DeepSpeed is not working.

bongmo commented 1 month ago

I think you miss parameter "--deepspeed"

wanglaofei commented 1 week ago

Have you solve the problem?