@kohya-ss @lansing @rockerBOO @akx @tsukimiya With the following configuration, multi-GPU training works properly, and the results are normal.
Does sd-scripts not support DeepSpeed acceleration? Could you help me check it?
However, the downside is that the maximum batch size per GPU can only be set to 1. I tried using DeepSpeed to reduce memory usage and increase the batch size, but the following error occurred:
enable full bf16 training. rank4: Traceback (most recent call last): rank4: File "/home/project/sd-scripts_sd3/flux_train.py", line 905, in
rank4: File "/home/project/sd-scripts_sd3/flux_train.py", line 427, in train rank4: flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks]) rank4: File "/home/work/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1248, in prepare rank4: raise ValueError("You can't customize device placements with DeepSpeed or Megatron-LM.") rank4: ValueError: You can't customize device placements with DeepSpeed or Megatron-LM. rank1: Traceback (most recent call last): rank1: File "/home/project/sd-scripts_sd3/flux_train.py", line 905, in
rank1: File "/home/project/sd-scripts_sd3/flux_train.py", line 427, in train rank1: flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks]) rank1: File "/home/work/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1248, in prepare rank1: raise ValueError("You can't customize device placements with DeepSpeed or Megatron-LM.") rank1: ValueError: You can't customize device placements with DeepSpeed or Megatron-LM.
@kohya-ss @lansing @rockerBOO @akx @tsukimiya With the following configuration, multi-GPU training works properly, and the results are normal. Does sd-scripts not support DeepSpeed acceleration? Could you help me check it?
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 accelerate launch --config_file
base_configs/accelerate_config.yaml --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py \ --pretrained_model_name_or_path /home/weights/FLUX.1-dev/flux1-dev.sft \ --clip_l /home/weights/FLUX.1-dev/clip_l.safetensors \ --t5xxl /home/weights/FLUX.1-dev/t5xxl_fp16.safetensors \ --ae /home/weights/FLUX.1-dev/ae.sft \ --resolution "1280,768" \ --enable_bucket \ --min_bucket_reso 256 \ --max_bucket_reso 1280 \ --bucket_reso_steps 64 \ --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 \ --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --train_data_dir ${TRAIN_DATA_DIR} --output_dir out_weights/${EXP_NAME} --output_name flux-sft \ --learning_rate 1e-5 --max_train_epochs 100 --save_every_n_epochs 5 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk \ --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \ --lr_scheduler constant_with_warmup --max_grad_norm 0.0 \ --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 \ --fused_backward_pass --cpu_offload_checkpointing --full_bf16
However, the downside is that the maximum batch size per GPU can only be set to 1. I tried using DeepSpeed to reduce memory usage and increase the batch size, but the following error occurred: enable full bf16 training.
rank4: Traceback (most recent call last):
rank4: File "/home/project/sd-scripts_sd3/flux_train.py", line 905, in
rank4: File "/home/project/sd-scripts_sd3/flux_train.py", line 427, in train
rank4: flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks])
rank4: File "/home/work/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1248, in prepare
rank4: raise ValueError("You can't customize device placements with DeepSpeed or Megatron-LM.")
rank4: ValueError: You can't customize device placements with DeepSpeed or Megatron-LM.
rank1: Traceback (most recent call last):
rank1: File "/home/project/sd-scripts_sd3/flux_train.py", line 905, in
rank1: File "/home/project/sd-scripts_sd3/flux_train.py", line 427, in train
rank1: flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks])
rank1: File "/home/work/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1248, in prepare
rank1: raise ValueError("You can't customize device placements with DeepSpeed or Megatron-LM.")
rank1: ValueError: You can't customize device placements with DeepSpeed or Megatron-LM.
Here is the DeepSpeed configuration file deepspeed_config.yaml
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 2 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 6 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false main_process_port: 29501
Here is the training command you are referring to: CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 accelerate launch --config_file base_configs/deepspeed_config.yaml --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py \ --pretrained_model_name_or_path /home/weights/FLUX.1-dev/flux1-dev.sft \ --clip_l /home/weights/FLUX.1-dev/clip_l.safetensors \ --t5xxl /home/weights/FLUX.1-dev/t5xxl_fp16.safetensors \ --ae /home/weights/FLUX.1-dev/ae.sft \ --train_batch_size 1 \ --resolution "1280,768" \ --enable_bucket \ --min_bucket_reso 256 \ --max_bucket_reso 1280 \ --bucket_reso_steps 64 \ --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 \ --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 \ --train_data_dir ${TRAIN_DATA_DIR} --output_dir out_weights/${EXP_NAME} --output_name flux-sft \ --learning_rate 1e-5 --max_train_epochs 100 --save_every_n_epochs 5 --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk \ --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \ --lr_scheduler constant_with_warmup --max_grad_norm 0.0 \ --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 \ --fused_backward_pass --cpu_offload_checkpointing --full_bf16