Open pritamdamania87 opened 1 year ago
cc: @kwen2501
@pritamdamania87 what is your torch and torchvision version?
@monajalal Using the nightly PyTorch builds via pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu120
cc: @wconstab
we met the same problem here, with ngc pytorch:23.07-py3 env
Can anyone confirm whether the stuck threads that finally throw are throwing the 'cuda driver shutdown' error? If they are, this may already be fixed by https://github.com/pytorch/pytorch/pull/106503, or could be fixed by the same exception handling trick in another place. Or is this something else?
While the stacktrace is shown here, the exception type is not.
Can anyone confirm whether the stuck threads that finally throw are throwing the 'cuda driver shutdown' error? If they are, this may already be fixed by https://github.com/pytorch/pytorch/pull/106503, or could be fixed by the same exception handling trick in another place. Or is this something else?
The stuck threads never throw an exception, the training process just gets stuck forever.
@kwen2501 IIUC you commented elsewhere that this issue was known to be due to nccl API changes where an API that previously would not block started to block, meanwhile our design did not update to reflect this change and leads to hangs.
Can you confirm/elaborate any more details?
And can we brainstorm some solutions?
@pritamdamania87 does this issue go away after your PR to release the GIL in cuda ops?
im also wondering about setting up a CI test for killing one rank and ensuring the rest exit cleanly.
@pritamdamania87 does this issue go away after your PR to release the GIL in cuda ops?
No it doesn't that is a separate issue. This issue exists even without any thread holding GIL.
im also wondering about setting up a CI test for killing o
This might be tricky since the behavior is a bit non-deterministic especially with a smaller set of ranks. So it could end up becoming a flaky test.
@pritamdamania87 @zdevito I think this issue would be fixed by landing https://github.com/pytorch/pytorch/pull/122732. If you have a way to repro, let us know or please help us confirm.
🐛 Describe the bug
There is a regression in NCCL error handling after https://github.com/pytorch/pytorch/pull/97066 on PyTorch master. If one rank of a training job is killed/dies, some ranks of a training job get stuck with this traceback of the workCleanupLoop:
Basically the
workCleanupLoop
is stuck oncudaGetLastError
which is blocked since NCCL kernels are locking up the GPU. As a result,workCleanupLoop
never gets a chance toabort
the communicators and recover. This is essentially the potential issue reported in https://github.com/pytorch/pytorch/pull/97066#discussion_r1164692531. I was able to reliably reproduce the issue on a job with 24 ranks and manually killing one rank. After this 9 ranks did not exit and were stuck with the traceback above.As mentioned in https://github.com/pytorch/pytorch/pull/97066#discussion_r1164692531, we should probably add a separate lightweight thread whose sole responsibility is to abort NCCL kernels that run into errors.
Versions
PyTorch: master CUDA: 12.0 NCCL: 2.17.1
cc @mrshenli @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu