Open neuron-party opened 4 days ago
Can you try to increase the NCCL timeout value and see if that helps?
@sayakpaul i did by passing the timeout arg when initializing the Accelerator object. increasing it to a reasonable number delays the error to a later iteration, increasing it to too large a number causes a timeout of its own
Okay. Then maybe precomputing dataset processing step outputs would be more useful in this setup?
I trained SD3 controlnet with the same issue. And I also found that during multi-GPU training, the computation of text embedding will only be computed on one GPU, and I really don't know why.
Yeah that is how it's coded. For full-blown distributed support, I welcome you to check out https://github.com/huggingface/diffusers/blob/main/examples/research_projects/controlnet/train_controlnet_webdataset.py as a reference.
The training script is meant to serve as an educational reference.
Describe the bug
Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:
however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().
Reproduction
accelerate launch --config_file configs/distributed train_controlnet_flux.py \ --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir="path" \ --mixed_precision="bf16" \ --resolution=1024 \ --learning_rate=5e-6 \ --max_train_steps=100000 \ --validation_steps=1000 \ --checkpointing_steps=25000 \ --validation_image "placeholder" \ --validation_prompt "placeholder" \ --train_batch_size=4 \ --gradient_accumulation_steps=1 \ --report_to="tensorboard" \ --seed=42 \ --jsonl_for_train="path" \ --cache_dir="path"
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 4 use_cpu: false
Logs
System Info
diffusers from source accelerate == 1.1.1 datasets == 3.1.0 transformers == 4.46.2
Who can help?
No response