Signal 7 error while finetuning with deepspeed

udhavsethi commented 1 year ago

I am trying to run the finetuning script using 8 32GB V100 GPUs. I am using the torchrun command for using deepspeed with both parameter and optimizer offload, with a few minor modifications:

torchrun --nproc_per_node=8 --master_port=3030 train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --fp16 True \
    --output_dir output \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_opt_param.json"

I am running into the following errors:

Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/train.py", line 222, in <module>
    train()
  File "/root/chat-llm/stanford_alpaca/train.py", line 186, in train
    model = transformers.LlamaForCausalLM.from_pretrained(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2498, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
    f(module, *args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 659, in __init__
    self.model = LlamaModel(config)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
    f(module, *args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 463, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 389, in wrapper
    self._post_init_method(module)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 782, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 217, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 81, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
    return func(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1551, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Call to recv from 10.233.121.250<45143> failed : Connection reset by peer
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 36604 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 36605 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 36601) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 36602)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36602
[2]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 36603)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36603
[3]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 5 (local_rank: 5)
  exitcode  : -7 (pid: 36606)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36606
[4]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 6 (local_rank: 6)
  exitcode  : -7 (pid: 36607)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36607
[5]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 7 (local_rank: 7)
  exitcode  : -7 (pid: 36608)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36608
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 36601)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36601
=====================================================

Here is my nvcc version:

$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

and nccl version:

$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)

Please let me know if I can provide any other information to identify the source of this issue. I would highly appreciate any help or guidance on how to make this work.