nccl timeout on train_controlnet_flux.py when doing multigpu training

neuron-party commented 4 days ago

Describe the bug

Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:

from accelerate import InitProcessGroupKwargs
from datetime import timedelta

x = InitProcessGroupKwargs(timeout=timedelta(seconds=N)))

accelerator = Accelerator(
   ...,
   kwargs_handlers = [x]
)

however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().

Reproduction

accelerate launch --config_file configs/distributed train_controlnet_flux.py \ --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir="path" \ --mixed_precision="bf16" \ --resolution=1024 \ --learning_rate=5e-6 \ --max_train_steps=100000 \ --validation_steps=1000 \ --checkpointing_steps=25000 \ --validation_image "placeholder" \ --validation_prompt "placeholder" \ --train_batch_size=4 \ --gradient_accumulation_steps=1 \ --report_to="tensorboard" \ --seed=42 \ --jsonl_for_train="path" \ --cache_dir="path"

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 4 use_cpu: false

Logs

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

System Info

diffusers from source accelerate == 1.1.1 datasets == 3.1.0 transformers == 4.46.2

Who can help?

No response

sayakpaul commented 3 days ago

Can you try to increase the NCCL timeout value and see if that helps?

neuron-party commented 3 days ago

@sayakpaul i did by passing the timeout arg when initializing the Accelerator object. increasing it to a reasonable number delays the error to a later iteration, increasing it to too large a number causes a timeout of its own

sayakpaul commented 3 days ago

Okay. Then maybe precomputing dataset processing step outputs would be more useful in this setup?

xduzhangjiayu commented 3 days ago

I trained SD3 controlnet with the same issue. And I also found that during multi-GPU training, the computation of text embedding will only be computed on one GPU, and I really don't know why.

sayakpaul commented 1 day ago

Yeah that is how it's coded. For full-blown distributed support, I welcome you to check out https://github.com/huggingface/diffusers/blob/main/examples/research_projects/controlnet/train_controlnet_webdataset.py as a reference.

The training script is meant to serve as an educational reference.

huggingface / diffusers