huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.34k stars 1.17k forks source link

DDPO job with Accelerator fails in a multi-gpu node #2090

Open shashankg7 opened 3 days ago

shashankg7 commented 3 days ago

System Info

Information

Tasks

Reproduction

I am trying to run the DDPO script: https://github.com/huggingface/trl/blob/main/examples/scripts/ddpo.py, on a slurm single node with 4 GPUs using the following job script:

#!/bin/bash

#SBATCH --job-name=multigpu
#SBATCH --output=O-%x.%j
#SBATCH --error=E-%x.%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1         # number of MP tasks
#SBATCH --gres=gpu:4                # number of GPUs per node
#SBATCH --cpus-per-task=64        # number of cores per tasks
#SBATCH --time=01:59:00             # maximum execution time (HH:MM:SS)

######################
### Set enviroment ###
######################
source /data/home/shashankgupta/miniconda3/etc/profile.d/conda.sh 
conda activate vqdiffuser
export GPUS_PER_NODE=4
######################

export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}"

accelerate launch --num_processes $GPUS_PER_NODE --multi_gpu ddpo.py --num_epochs=200 --train_gradient_accumulation_steps=1 --sample_num_steps=50 --sample_batch_size=6  --train_batch_size=3  --sample_num_batches_per_epoch=4 --train_learning_rate=3e-4 --per_prompt_stat_tracking=True  --mixed_precision=no --per_prompt_stat_tracking_buffer_size=64 --tracker_project_name="stable_diffusion_training" --log_with="wandb"

I am getting the following error message across different ranks (pasting from a single rank):

[rank3]:   File "/opt/hpcaas/.mounts/fs-074514506a8464fcb/home/shashankgupta/RLHF_Compositionality/ddpo.py", line 208, in <module>
[rank3]:     trainer.train()
[rank3]:   File "/data/home/shashankgupta/miniconda3/envs/vqdiffuser/lib/python3.9/site-packages/trl/trainer/ddpo_trainer.py", line 605, in train
[rank3]:     global_step = self.step(epoch, global_step)
[rank3]:   File "/data/home/shashankgupta/miniconda3/envs/vqdiffuser/lib/python3.9/site-packages/trl/trainer/ddpo_trainer.py", line 267, in step
[rank3]:     self.image_samples_callback(prompt_image_data, global_step, self.accelerator.trackers[0])
[rank3]: IndexError: list index out of range
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/hpcaas/.mounts/fs-074514506a8464fcb/home/shashankgupta/RLHF_Compositionality/ddpo.py", line 208, in <module>
[rank1]:     trainer.train()
[rank1]:   File "/data/home/shashankgupta/miniconda3/envs/vqdiffuser/lib/python3.9/site-packages/trl/trainer/ddpo_trainer.py", line 605, in train
[rank1]:     global_step = self.step(epoch, global_step)
[rank1]:   File "/data/home/shashankgupta/miniconda3/envs/vqdiffuser/lib/python3.9/site-packages/trl/trainer/ddpo_trainer.py", line 267, in step
[rank1]:     self.image_samples_callback(prompt_image_data, global_step, self.accelerator.trackers[0])
[rank1]: IndexError: list index out of range
[rank2]: Traceback (most recent call last):
[rank2]:   File "/opt/hpcaas/.mounts/fs-074514506a8464fcb/home/shashankgupta/RLHF_Compositionality/ddpo.py", line 208, in <module>
[rank2]:     trainer.train()
[rank2]:   File "/data/home/shashankgupta/miniconda3/envs/vqdiffuser/lib/python3.9/site-packages/trl/trainer/ddpo_trainer.py", line 605, in train
[rank2]:     global_step = self.step(epoch, global_step)
[rank2]:   File "/data/home/shashankgupta/miniconda3/envs/vqdiffuser/lib/python3.9/site-packages/trl/trainer/ddpo_trainer.py", line 267, in step
[rank2]:     self.image_samples_callback(prompt_image_data, global_step, self.accelerator.trackers[0])
[rank2]: IndexError: list index out of range

Expected behavior

The example script works with a single GPU.

TongLiu-github commented 1 day ago

@shashankg7 I meet the similar error when working well with a single GPU but failing with more than 1. Did you solve this problem? The only difference is that I use ORPOTrainer with the example code.

I got error message:

[rank1]:[E923 02:13:27.841163561 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
[rank0]:[E923 02:13:27.841320769 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
[rank1]:[E923 02:13:27.937374548 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank0]:[E923 02:13:27.943049056 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank0]:[E923 02:13:30.295290567 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank0]:[E923 02:13:30.295337737 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 02:13:30.295347767 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 02:13:30.411567476 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x739ef8ecbf86 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x739eaadca8f2 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x739eaadd1333 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x739eaadd371c in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x739ef9416bf4 in /home/wiss/liu/anaconda3/envs/simpo/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x739ef9e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x739ef9f29c3c in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E923 02:13:30.480022057 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank1]:[E923 02:13:30.480077366 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E923 02:13:30.480084906 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E923 02:13:30.481265143 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72aa146cbf86 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x72a9c65ca8f2 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72a9c65d1333 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x72a9c65d371c in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x72aa14c84bf4 in /home/wiss/liu/anaconda3/envs/simpo/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x72aa1589ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x72aa15929c3c in /lib/x86_64-linux-gnu/libc.so.6)

W0923 02:13:30.841000 123608268355072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3293540 closing signal SIGTERM
E0923 02:13:30.906000 123608268355072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 3293539) of binary: /home/wiss/liu/anaconda3/envs/simpo/bin/python
Warning: The cache directory for DeepSpeed Triton autotune, /home/wiss/liu/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Traceback (most recent call last):
  File "/home/wiss/liu/anaconda3/envs/simpo/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    deepspeed_launcher(args)
  File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
orpo.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-23_02:13:30
  host      : worker-6
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 3293539)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3293539
========================================================

I went through some issues proposed by the others, and I did not set device_map='auto'. I think something must be wrong in the code.

update:

problem solved from: https://github.com/huggingface/accelerate/issues/314#issue-1201142707