huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.1k stars 5.38k forks source link

SDXL Training Fails for Multi GPU Machine #7960

Closed humanely closed 5 months ago

humanely commented 5 months ago

Describe the bug

While the training script **train_text_to_image_lora_sdxl.py** runs perfectly fine on A100 1 GPU machine, it fails to complete the data mapping on machines with multiple GPUs. I have tried following:

export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1


05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding'} was not found in config. Values will be initialized to default values.
05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16 

Map:  12%|██████████████▉                                                                                                               | 28000/236574 [09:59<1:10:56, 49.00 examples/s][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600015 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 6000

Reproduction

Execute the example training script on A100 Machine with 4 GPU

Logs

05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding'} was not found in config. Values will be initialized to default values.
05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

05/16/2024 13:59:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16 

Map:  12%|██████████████▉                                                                                                               | 28000/236574 [09:59<1:10:56, 49.00 examples/s][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600015 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 6000

System Info

Who can help?

@sayakpaul

bghira commented 5 months ago

try adding this to the training script:

if torch.cuda.is_available():
    os.environ["NCCL_SOCKET_NTIMEO"] = "2000000"
bghira commented 5 months ago

there's also InitProcessKwargs method of setting it:

    # Create the custom configuration
    process_group_kwargs = InitProcessGroupKwargs(
        timeout=timedelta(seconds=5400)
    )  # 1.5 hours
    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
        kwargs_handlers=[process_group_kwargs],
    )
humanely commented 5 months ago

Thanks @bghira I think it solves the issue, otherwise I may increase it to 6 hours. Here is the screenshot of Mapping. Taking more than 5 hours in total. Is it normal? I am using MSCOCO image dataset.

image

bghira commented 5 months ago

it is quite typical and a common workaround but i don't have full time access to multi-gpu systems - would love to experiment with different workarounds.

they should be trivial to test by reducing the accelerator nccl timeout to a very low value and then try the workaround while running a blocking task for longer than the timeout.

i'm assuming there's some kind of way to signal to nccl that you're still "busy" and "alive".

bghira commented 5 months ago

for what it's worth this is why i cache things to disk in simpletuner, you can aim it at eg. a cloudflare R2 object storage bucket with training data, so that you can do preprocessing on one system and resume into the actual training job from a bigger/more expensive system.

sayakpaul commented 5 months ago

Thanks for helping, @bghira!

@humanely the training scripts are not optimized to give you the best throughput during training. So, cannot guarantee that. Please refer to the simpletuner repo of @bghira which has better offerings in this regard.

Closing the issue because the issue seems to be resolved now.