meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
11.61k stars 1.65k forks source link

Multi-GPU training fails under collective operation timeout #567

Closed BaiqingL closed 1 month ago

BaiqingL commented 3 months ago

System Info

ml.g5.12xlarge instance from AWS, with pyTorch 2.3.1, 4x A10G, CUDA 12.1

Modified dataset since I already pre-tokenized everything to avoid using time on GPU instances to reduce costs at https://huggingface.co/datasets/BaiqingL/pokemon-rag-llama-3-tokenized

Tokenizer has been modified in the following way

    # Load the tokenizer and add special tokens
    LLM_ACTION = "LLM_ACTION"
    MOVE_CHOSEN = "MOVE_CHOSEN"
    SWITCH_PKMN = "SWITCH_PKMN"
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token="oh no")
    tokenizer.add_special_tokens(
        {"additional_special_tokens": [SWITCH_PKMN, MOVE_CHOSEN, LLM_ACTION]}
    )
    tokenizer.pad_token = tokenizer.eos_token

Rest of the training script contains resize.

Dataset has been modified in such a way

  ds = load_dataset(
      "BaiqingL/pokemon-rag-llama-3-tokenized",
      cache_dir="/home/ec2-user/SageMaker/cache",
      split="train[:1%]"
  ).train_test_split(test_size=500)
  # Load and preprocess the dataset for training and validation
  dataset_train = ds["train"]

And val dataset:

dataset_val = ds["test"]

Information

🐛 Describe the bug

After the final step of training, presumably during model saving, process crashes and wastes all that time training... Command executed:

torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --num_workers_dataloader 12 --enable_fsdp --model_name meta-llama/Meta-Llama-3-8B --use_peft --batch_size_training 2 --context_length 2048 --num-epochs 1 --peft_method lora --save_metrics --output_dir /home/ec2-user/SageMaker/output

Error logs

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f36762e4c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f36762e9a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f36762eadcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f36762e4c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f36762e9a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f36762eadcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f3675f6e119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5403b6bc62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5403b70a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5403b71dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5403b6bc62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5403b70a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5403b71dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f54037f5119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbacf215c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbacf21aa80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbacf21bdcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbacf215c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbacf21aa80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbacf21bdcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7fbacee9f119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ffa8a61ec62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffa8a623a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffa8a624dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ffa8a61ec62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffa8a623a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffa8a624dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7ffa8a2a8119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

E0617 01:16:30.106000 139731967973184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 40911) of binary: /home/ec2-user/anaconda3/envs/pytorch_p310/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
finetuning_2.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 40912)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40912
[2]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 40913)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40913
[3]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 40914)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40914
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 40911)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40911
======================================================

Expected behavior

Save the model

wukaixingxp commented 3 months ago

Hi! From your log, I did not see the root cause of the NCCL timeout. I wonder if this error is reproducible? It may be a worker got killed somehow or NCCL connection is disrupted somehow. We can first check your NCCL config and here are some ways that help you check the correctness of the NCCL configs. (1) Run official NCCL all_reduce_perf. (2) Try Huggingface multi-GPU debug script. (3) If both tests mentioned above passed, then export NCCL_DEBUG=INFO and rerun the distributed training using our official example, see if the NCCL communications info gives any error or warning, you can paste back the NCCL info for me to double check.

wukaixingxp commented 3 months ago

If you believe your NCCL config is correct, then I suggest you use a small dataset and try to use py-spy record or dump function to track your pytorch main thread call-stack to see the name of the last function that has been run before crash.