ALLGATHER_BASE timeout error

aknvictor commented 1 month ago

I keep getting this error when I run with

CUDA_LAUNCH_BLOCKING=1; tune run --nproc_per_node 4 lora_finetune_distributed --config scripts/2B_lora.yaml

any thoughts what I might be doing wrong. I'm running the latest version (from github)

1|16|Loss: 2.572175979614258: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:25<00:00,  1.48s/it]
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeo
ut(ms)=600000) ran for 600055 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeo
ut(ms)=600000) ran for 600059 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeo
ut(ms)=600000) ran for 600086 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 15037, last enqueued NCCL work: 15042, last completed NCCL work: 15036.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on cor
rupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=150
37, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f71e1c81897 in /miniconda/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f71e2f5ac62 in /minic
onda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f71e2f5fa80 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f71e2f60dcc in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f722ea18bf4 in /miniconda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7f72304551cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7f722f937dd3 in /lib64/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 15037, last enqueued NCCL work: 15042, last completed NCCL work: 15036.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on cor
rupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=150
37, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb024756897 in /miniconda/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb025a2fc62 in /minic
onda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb025a34a80 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb025a35dcc in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7fb0714edbf4 in /miniconda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7fb072f2a1cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb07240cdd3 in /lib64/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 15037, last enqueued NCCL work: 15042, last completed NCCL work: 15036.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeout(ms)=600000) ran for 600055 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f654644a897 in /miniconda/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f6547723c62 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6547728a80 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6547729dcc in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f65931e1bf4 in /miniconda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7f6594c1e1cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6594100dd3 in /lib64/libc.so.6)

ebsmothers commented 1 month ago

Hi @aknvictor thanks for creating the issue. With NCCL timeouts it can be hard to pinpoint the cause. Looks like this is occurring at the end of an epoch? Also I'm curious what's in the scripts/2B_lora.yaml config -- is it just a copied version of gemma/2B_lora.yaml, or have you made any other customizations?

aknvictor commented 1 month ago

Yes, the error occurs at the end of an epoch. Yes, 2B_lora.yaml is copy of the original gemma/2B_lora.yaml config with only modifications to the file paths and batch_size.

ebsmothers commented 1 month ago

@aknvictor not sure what type of GPU you're on, but if possible can you try to run on a single device instead? Something like tune run lora_finetune_single_device --config scripts/2B_lora_single_device.yaml, where scripts/2B_lora_single_device.yaml is an analogous copy of torchtune's corresponding single-device config gemma/2B_lora_single_device.yaml.

I tried to repro on my end and I see the same error as in #1122, so wondering if that's the underlying cause here and the distributed run is masking the real source of the error.

aknvictor commented 1 month ago

Yes, I did run it on a single device (A100). It works fine after i skipped the erroneous key in the checkpoint save (as a temporary hack).

if key == 'lm_head.weight':
      continue

Admittedly, the bug/issue may be broader than that (especially when the run is distributed)

ebsmothers commented 1 month ago

It works fine after i skipped the erroneous key in the checkpoint save (as a temporary hack).

@aknvictor just to clarify, does skipping the key in the distributed case resolve the original timeout error? Or do you still see it even after removing that line?

aknvictor commented 1 month ago

The timeout error is still there when the run is distributed.

pbontrager commented 1 month ago

We fixed the Gemma checkpoint issue (Issue #1190). Could you try running your script again without the code below?

if key == 'lm_head.weight':
      continue

aknvictor commented 1 month ago

Yes. I'm still getting the error

ebsmothers commented 1 month ago

Hi @aknvictor sorry for the delay here. If you're still seeing the timeout error on distributed runs after pulling from latest main, would you be able to (a) provide more details of your environment (pip list, what hardware you're running on) and (b) help pinpoint where exactly the hang is occurring? I'm assuming it's somewhere in checkpoint save, maybe when trying to gather parameters from different GPUs, but not sure. (One hacky way to narrow it down for (b) is just add a call to torch.distributed.barrier() and then raise an error immediately afterwards; then you can bisect where in the code the hang is occurring based on whether you get this error or not)

aknvictor commented 3 weeks ago

The issues has been resolved in latest main. Thanks!

pytorch / torchtune

ALLGATHER_BASE timeout error #1165