Multi-GPU training fails under collective operation timeout

System Info

ml.g5.12xlarge instance from AWS, with pyTorch 2.3.1, 4x A10G, CUDA 12.1

Modified dataset since I already pre-tokenized everything to avoid using time on GPU instances to reduce costs at https://huggingface.co/datasets/BaiqingL/pokemon-rag-llama-3-tokenized

Tokenizer has been modified in the following way

    # Load the tokenizer and add special tokens
    LLM_ACTION = "LLM_ACTION"
    MOVE_CHOSEN = "MOVE_CHOSEN"
    SWITCH_PKMN = "SWITCH_PKMN"
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token="oh no")
    tokenizer.add_special_tokens(
        {"additional_special_tokens": [SWITCH_PKMN, MOVE_CHOSEN, LLM_ACTION]}
    )
    tokenizer.pad_token = tokenizer.eos_token

Rest of the training script contains resize.

Dataset has been modified in such a way

  ds = load_dataset(
      "BaiqingL/pokemon-rag-llama-3-tokenized",
      cache_dir="/home/ec2-user/SageMaker/cache",
      split="train[:1%]"
  ).train_test_split(test_size=500)
  # Load and preprocess the dataset for training and validation
  dataset_train = ds["train"]

And val dataset:

dataset_val = ds["test"]

Information

[X] The official example scripts
[X] My own modified scripts

🐛 Describe the bug

After the final step of training, presumably during model saving, process crashes and wastes all that time training... Command executed:

torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --num_workers_dataloader 12 --enable_fsdp --model_name meta-llama/Meta-Llama-3-8B --use_peft --batch_size_training 2 --context_length 2048 --num-epochs 1 --peft_method lora --save_metrics --output_dir /home/ec2-user/SageMaker/output

Error logs

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f36762e4c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f36762e9a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f36762eadcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f36762e4c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f36762e9a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f36762eadcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f3675f6e119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5403b6bc62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5403b70a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5403b71dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5403b6bc62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5403b70a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5403b71dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f54037f5119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbacf215c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbacf21aa80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbacf21bdcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbacf215c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbacf21aa80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbacf21bdcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7fbacee9f119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ffa8a61ec62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffa8a623a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffa8a624dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ffa8a61ec62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffa8a623a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffa8a624dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7ffa8a2a8119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

E0617 01:16:30.106000 139731967973184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 40911) of binary: /home/ec2-user/anaconda3/envs/pytorch_p310/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
finetuning_2.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 40912)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40912
[2]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 40913)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40913
[3]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 40914)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40914
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 40911)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40911
======================================================

Expected behavior

Save the model

meta-llama / llama-recipes

Multi-GPU training fails under collective operation timeout #567

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior