Regression in NCCL error handling

pritamdamania87 commented 1 year ago

🐛 Describe the bug

There is a regression in NCCL error handling after https://github.com/pytorch/pytorch/pull/97066 on PyTorch master. If one rank of a training job is killed/dies, some ranks of a training job get stuck with this traceback of the workCleanupLoop:

Thread 9 (Thread 0x1554539fc700 (LWP 153338)):
#0  0x0000155520c47cdc in ?? () from lib/libcudart.so.11.0
#1  0x0000155520c14bde in ?? () from lib/libcudart.so.11.0
#2  0x0000155520c4a915 in cudaGetLastError () from lib/libcudart.so.11.0
#3  0x00001555097933b5 in c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const () from lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#4  0x0000155509794008 in c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() () from lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#5  0x0000155509793a25 in c10d::ProcessGroupNCCL::workCleanupLoop() () from lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#6  0x000015552081aa93 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#7  0x00001555554f5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00001555552b4133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Basically the workCleanupLoop is stuck on cudaGetLastError which is blocked since NCCL kernels are locking up the GPU. As a result, workCleanupLoop never gets a chance to abort the communicators and recover. This is essentially the potential issue reported in https://github.com/pytorch/pytorch/pull/97066#discussion_r1164692531. I was able to reliably reproduce the issue on a job with 24 ranks and manually killing one rank. After this 9 ranks did not exit and were stuck with the traceback above.

As mentioned in https://github.com/pytorch/pytorch/pull/97066#discussion_r1164692531, we should probably add a separate lightweight thread whose sole responsibility is to abort NCCL kernels that run into errors.

Versions

PyTorch: master CUDA: 12.0 NCCL: 2.17.1

cc @mrshenli @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

fduwjj commented 1 year ago

cc: @kwen2501

monajalal commented 1 year ago

@pritamdamania87 what is your torch and torchvision version?

pritamdamania87 commented 1 year ago

@monajalal Using the nightly PyTorch builds via pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu120

chauhang commented 1 year ago

cc: @wconstab

teou commented 11 months ago

we met the same problem here, with ngc pytorch:23.07-py3 env

wconstab commented 10 months ago

Can anyone confirm whether the stuck threads that finally throw are throwing the 'cuda driver shutdown' error? If they are, this may already be fixed by https://github.com/pytorch/pytorch/pull/106503, or could be fixed by the same exception handling trick in another place. Or is this something else?

While the stacktrace is shown here, the exception type is not.

pritamdamania87 commented 10 months ago

Can anyone confirm whether the stuck threads that finally throw are throwing the 'cuda driver shutdown' error? If they are, this may already be fixed by https://github.com/pytorch/pytorch/pull/106503, or could be fixed by the same exception handling trick in another place. Or is this something else?

The stuck threads never throw an exception, the training process just gets stuck forever.

wconstab commented 10 months ago

@kwen2501 IIUC you commented elsewhere that this issue was known to be due to nccl API changes where an API that previously would not block started to block, meanwhile our design did not update to reflect this change and leads to hangs.

Can you confirm/elaborate any more details?

And can we brainstorm some solutions?

wconstab commented 9 months ago

@pritamdamania87 does this issue go away after your PR to release the GIL in cuda ops?

im also wondering about setting up a CI test for killing one rank and ensuring the rest exit cleanly.

pritamdamania87 commented 9 months ago

@pritamdamania87 does this issue go away after your PR to release the GIL in cuda ops?

No it doesn't that is a separate issue. This issue exists even without any thread holding GIL.

im also wondering about setting up a CI test for killing o

This might be tricky since the behavior is a bit non-deterministic especially with a smaller set of ranks. So it could end up becoming a flaky test.

wconstab commented 3 months ago

@pritamdamania87 @zdevito I think this issue would be fixed by landing https://github.com/pytorch/pytorch/pull/122732. If you have a way to repro, let us know or please help us confirm.

pytorch / pytorch

Regression in NCCL error handling #101463

🐛 Describe the bug

Versions