Open shashankg7 opened 3 days ago
@shashankg7 I meet the similar error when working well with a single GPU but failing with more than 1. Did you solve this problem? The only difference is that I use ORPOTrainer with the example code.
I got error message:
[rank1]:[E923 02:13:27.841163561 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
[rank0]:[E923 02:13:27.841320769 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
[rank1]:[E923 02:13:27.937374548 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank0]:[E923 02:13:27.943049056 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank0]:[E923 02:13:30.295290567 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank0]:[E923 02:13:30.295337737 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 02:13:30.295347767 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 02:13:30.411567476 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x739ef8ecbf86 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x739eaadca8f2 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x739eaadd1333 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x739eaadd371c in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x739ef9416bf4 in /home/wiss/liu/anaconda3/envs/simpo/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x739ef9e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x739ef9f29c3c in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E923 02:13:30.480022057 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 2, last completed NCCL work: -1.
[rank1]:[E923 02:13:30.480077366 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E923 02:13:30.480084906 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E923 02:13:30.481265143 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=38597376, NumelOut=38597376, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72aa146cbf86 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x72a9c65ca8f2 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72a9c65d1333 in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x72a9c65d371c in /home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x72aa14c84bf4 in /home/wiss/liu/anaconda3/envs/simpo/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x72aa1589ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x72aa15929c3c in /lib/x86_64-linux-gnu/libc.so.6)
W0923 02:13:30.841000 123608268355072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3293540 closing signal SIGTERM
E0923 02:13:30.906000 123608268355072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 3293539) of binary: /home/wiss/liu/anaconda3/envs/simpo/bin/python
Warning: The cache directory for DeepSpeed Triton autotune, /home/wiss/liu/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Traceback (most recent call last):
File "/home/wiss/liu/anaconda3/envs/simpo/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
distrib_run.run(args)
File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wiss/liu/anaconda3/envs/simpo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
orpo.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-23_02:13:30
host : worker-6
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 3293539)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3293539
========================================================
I went through some issues proposed by the others, and I did not set device_map='auto'. I think something must be wrong in the code.
update:
problem solved from: https://github.com/huggingface/accelerate/issues/314#issue-1201142707
System Info
transformers
version: 4.44.2Information
Tasks
examples
folderReproduction
I am trying to run the DDPO script: https://github.com/huggingface/trl/blob/main/examples/scripts/ddpo.py, on a slurm single node with 4 GPUs using the following job script:
I am getting the following error message across different ranks (pasting from a single rank):
Expected behavior
The example script works with a single GPU.