[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL

kevinningthu commented 3 months ago

Use colbert to index about 20GB data on 4xA800 GPUs, following errors raised:

Clustering 111099511 points in 128D to 524288 clusters, redo 1 times, 4 iterations Preprocessing in 8.98 s [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600659 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 6006 59 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f99d5781d87 in

kevinningthu commented 3 months ago

solved by setting Timeout to 6000_000 in distributed.py

fate-ubw commented 2 months ago

Hi I have the same bug, can you tell me how to fix the problem, I have no idea about your solution: setting Timeout to 6000_000 in distributed.py. Because I haven't found Timeout variables in distributed.py. Hope for your reply

stanford-futuredata / ColBERT

[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL #318