Louis-y-nlp commented 1 year ago

您好，使用v100进行多卡训练总会遇到超时错误，4卡、2卡均报错。使用单卡似乎没有这种问题但是速度较慢。微调5w数据大约需要12小时。

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1805926 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805991 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

运行脚本

accelerate launch src/train_sft.py \
    --model_name_or_path ${model} \
    --do_train \
    --dataset my_dataset \
    --prompt_template alpaca \
    --finetuning_type lora --lora_target W_pack \
    --output_dir ${out_model} \
    --overwrite_cache \
    --per_device_train_batch_size 4 \ 
    --gradient_accumulation_steps 4 \ 
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --auto_find_batch_size true --per_device_train_batch_size 16

default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: /home/work/data/codes/LLaMA-Efficient-Tuning/deepspeed_config_stage2.yaml
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

hiyouga commented 1 year ago

是不是用了 nohup？

Louis-y-nlp commented 1 year ago

没有起后台，在docker中直接运行的。

hiyouga commented 1 year ago

尝试下 https://github.com/huggingface/accelerate/issues/223

Louis-y-nlp commented 1 year ago

简单增加超时时间似乎不能解决问题，测试了下是卡在logging step上了，应该是其他rank等待rank 0计算loss时卡死了，暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪，会有多个进度条。logging step设置为20的时候进度条为：

  0%|▏                                                                                                                                                                        | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|▎                                                                                                                                                                        | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

hiyouga commented 1 year ago

关闭 deepspeed 试试，用普通的 accelerate config。

Louis-y-nlp commented 1 year ago

依旧卡在logging step中。 config yaml 如下：

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: true
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: ''
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

另外昨天logging step 设置为无穷大之后，在一个save step时，成功保存了一个ckpt之后卡住了，经历了7200s（指定的超时时间）之后报了相同的错误。

RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3056, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7207694 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

hiyouga commented 1 year ago

试试这个 config：

compute_environment: LOCAL_MACHINE                                                                                                    
distributed_type: MULTI_GPU                                                                                                           
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Louis-y-nlp commented 1 year ago

还是会在logging step卡住

[INFO|trainer.py:1779] 2023-06-26 02:34:58,141 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-26 02:34:58,142 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-26 02:34:58,142 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-26 02:34:58,142 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-26 02:34:58,142 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-26 02:34:58,142 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-26 02:34:58,142 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-26 02:34:58,144 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:54:55, 27.48s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7205324 milliseconds before timing out.11:15:08, 17.35s/it]
f07b9fe29941:61323:61360 [1] NCCL INFO [Service thread] Connection closed by localRank 1
f07b9fe29941:61323:61344 [0] NCCL INFO comm 0x4724c640 rank 1 nranks 2 cudaDev 1 busId d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=7200000) ran for 7206742 milliseconds before timing out.
f07b9fe29941:61322:61361 [0] NCCL INFO [Service thread] Connection closed by localRank 0
f07b9fe29941:61322:61341 [0] NCCL INFO comm 0x48215190 rank 0 nranks 2 cudaDev 0 busId c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[04:37:26] ERROR    failed (exitcode: -6) local_rank: 0 (pid: 61322) of binary: /root/anaconda3/envs/dolly/bin/python                                                                                  api.py:672
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/anaconda3/envs/dolly/bin/accelerate:8 in <module>                                          │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45  │
│ in main                                                                                          │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:928 in      │
│ launch_command                                                                                   │
│                                                                                                  │
│   925 │   │   args.deepspeed_fields_from_accelerate_config = ",".join(args.deepspeed_fields_fr   │
│   926 │   │   deepspeed_launcher(args)                                                           │
│   927 │   elif args.use_fsdp and not args.cpu:                                                   │
│ ❱ 928 │   │   multi_gpu_launcher(args)                                                           │
│   929 │   elif args.use_megatron_lm and not args.cpu:                                            │
│   930 │   │   multi_gpu_launcher(args)                                                           │
│   931 │   elif args.multi_gpu and not args.cpu:                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:627 in      │
│ multi_gpu_launcher                                                                               │
│                                                                                                  │
│   624 │   )                                                                                      │
│   625 │   with patch_environment(**current_env):                                                 │
│   626 │   │   try:                                                                               │
│ ❱ 627 │   │   │   distrib_run.run(args)                                                          │
│   628 │   │   except Exception:                                                                  │
│   629 │   │   │   if is_rich_available() and debug:                                              │
│   630 │   │   │   │   console = get_console()                                                    │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/run.py:785 in run       │
│                                                                                                  │
│   782 │   │   )                                                                                  │
│   783 │                                                                                          │
│   784 │   config, cmd, cmd_args = config_from_args(args)                                         │
│ ❱ 785 │   elastic_launch(                                                                        │
│   786 │   │   config=config,                                                                     │
│   787 │   │   entrypoint=cmd,                                                                    │
│   788 │   )(*cmd_args)                                                                           │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:134 in  │
│ __call__                                                                                         │
│                                                                                                  │
│   131 │   │   self._entrypoint = entrypoint                                                      │
│   132 │                                                                                          │
│   133 │   def __call__(self, *args):                                                             │
│ ❱ 134 │   │   return launch_agent(self._config, self._entrypoint, list(args))                    │
│   135                                                                                            │
│   136                                                                                            │
│   137 def _get_entrypoint_name(                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:250 in  │
│ launch_agent                                                                                     │
│                                                                                                  │
│   247 │   │   │   # if the error files for the failed children exist                             │
│   248 │   │   │   # @record will copy the first error (root cause)                               │
│   249 │   │   │   # to the error file of the launcher process.                                   │
│ ❱ 250 │   │   │   raise ChildFailedError(                                                        │
│   251 │   │   │   │   name=entrypoint_name,                                                      │
│   252 │   │   │   │   failures=result.failures,                                                  │
│   253 │   │   │   )                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError: 
======================================================
src/train_sft.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 61323)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61323
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 61322)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61322
======================================================

shaonianyr commented 1 year ago

把 NCCL 同步关了

Louis-y-nlp commented 1 year ago

加了NCCL_P2P_DISABLE=1之后第一步就会挂 @shaonianyr

wuxiuxiunlp commented 1 year ago

@Louis-y-nlp 多卡微调，跑通吗

Louis-y-nlp commented 1 year ago

没跑通，docker里一直卡死。

GitYCC commented 1 year ago

简单增加超时时间似乎不能解决问题，测试了下是卡在logging step上了，应该是其他rank等待rank 0计算loss时卡死了，暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪，会有多个进度条。logging step设置为20的时候进度条为：

  0%|▏                                                                                                                                                                        | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|▎                                                                                                                                                                        | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

@Louis-y-nlp How do you set the timeout value with accelerate launch?

Louis-y-nlp commented 1 year ago

@GitYCC

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

thugbobby commented 1 year ago

@Louis-y-nlp 请问您的多卡微调跑通了吗

Louis-y-nlp commented 1 year ago

没啊，多卡一直卡死，主要没有任何报错也不知道怎么调，就单卡能跑。

thugbobby commented 1 year ago

能跑

那有没有找到其他的解决方案，我试了好几个都不行。

Louis-y-nlp commented 1 year ago

拉了最新版本代码跑通了。

Louis-y-nlp commented 1 year ago

大佬神速啊，24小时高强度在线。

TianRuiHe commented 11 months ago

是不是用了 nohup？

您好，我也遇到了同样的问题，我使用了nohup进行后台挂起训练，请问这是什么原因呀具体来说我的使用nohup在后台运行了一个使用deepspeed进行训练的代码，在运行了大概1000多个step后报错： Connection closed by localRank -1 然后就停掉了

homiec commented 9 months ago

是不是用了 nohup？

想问一下，用了nohup就会有这个问题吗？

etoilestar commented 9 months ago

init_process_group

这个应该加在哪呢？

yawzhe commented 8 months ago

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

这个加在那呀，我设置 --ddp_time 加载数据集可以顺利加载一次，但是运行的时候要加载两次data_tokenizer,第二次就报错了。微信图片_20240318191811

JerryDaHeLian commented 8 months ago

数据集小没问题，数据集大就会timeout，很可能卡在tokenizer on dataset这一步，如果是，通过设置： --preprocessing_num_workers 128 解决。

CaiJichang212 commented 2 weeks ago

我也遇到了NCCL Timeout问题，对qwen2-vl-7b，仅lora微调时正常运行。如果下面是我的命令

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# 指定GPU编号
export CUDA_VISIBLE_DEVICES=0,1
# 单机多卡训练
export FORCE_TORCHRUN=1
# If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  
# See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# 激活llamafactory的虚拟环境
source activate torch2py311cu12lmft # 命令行base环境

cur_date=$(date +%Y-%m-%d)
mkdir -p /root/autodl-fs/log/$cur_date

# 选择模型
# model=qwen2_vl_2b
model=qwen2_vl_7b
# 选择微调方法
# method=full
method=lora
# 任务阶段，指令监督微调
task=sft

# 默认值
pdbs=1 # per device batch size
gas=5 # gradient accumulation steps
bs=$((gas*pdbs*2)) # total batch size
steps=10 # logging steps
epoch=10 # number of epochs 默认3，+2，+2=7
lr=5e-5 # learning rate
max_grad_norm=1.0 # gradient clipping threshold 主流开源大模型默认1.0
cutoff_len=2048 # 截断长度尝试2048会不会报错

cnt=3
lr=9e-5 
gas=8
bs=$((gas*pdbs*2))
echo ${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}
llamafactory-cli train \
    --freeze_vision_tower false \
    --max_grad_norm $max_grad_norm \
    --output_dir /root/autodl-fs/saves/${model}/${method}-${task}/lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm} \
    --logging_steps $steps \
    --save_strategy epoch \
    --per_device_train_batch_size $pdbs \
    --gradient_accumulation_steps $gas \
    --learning_rate $lr \
    --num_train_epochs $epoch \
    --model_name_or_path /root/autodl-fs/huggingface/Qwen2-VL-7B-Instruct \
    --stage $task \
    --do_train true \
    --finetuning_type $method \
    --lora_target all \
    --dataset mire_train_check \
    --template qwen2_vl \
    --cutoff_len $cutoff_len \
    --max_samples 1000 \
    --overwrite_cache true \
    --preprocessing_num_workers 16 \
    --plot_loss true \
    --overwrite_output_dir true \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 true \
    --ddp_timeout 180000000 \
    --flash_attn fa2 \
    --enable_liger_kernel true \
    --deepspeed examples/deepspeed/ds_z2_config.json \
    --eval_dataset mire_train_check \
    --per_device_eval_batch_size $((pdbs*2)) \
    --eval_strategy epoch \
    > /root/autodl-fs/log/${cur_date}/${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}.log 2>&1 && /usr/bin/shutdown

报错信息；卡在13/620 {'loss': 0.8541, 'grad_norm': 6.580702781677246, 'learning_rate': 1.4516129032258065e-05, 'epoch': 0.16}

2%|█▋ | 10/620 [00:59<58:46, 5.78s/it] 2%|█▊ | 11/620 [01:05<59:23, 5.85s/it] 2%|█▉ | 12/620 [01:11<58:11, 5.74s/it] 2%|██▏ | 13/620 [01:17<58:26, 5.78s/it][rank1]:[E1123 16:02:10.083090635 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. [rank1]:[E1123 16:02:10.083642868 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516. [rank0]:[E1123 16:02:10.114598162 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1516, OpType=ALLREDUCE, NumelIn=25427968, NumelOut=25427968, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank0]:[E1123 16:02:10.115071453 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1516, last enqueued NCCL work: 1516, last completed NCCL work: 1515. [rank1]:[E1123 16:02:11.535545146 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516. [rank1]:[E1123 16:02:11.535815819 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E1123 16:02:11.535957791 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E1123 16:02:11.538140975 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5af1f77f86 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5aa3f5f8d2 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f5aa3f66313 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5aa3f686fc in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f5af16c7bf4 in /root/miniconda3/envs/torch2py311cu12lmft/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f5af2dddac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f5af2e6ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1123 16:02:11.882000 139878429300544 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1351 closing signal SIGTERM E1123 16:02:12.398000 139878429300544 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 1352) of binary: /root/miniconda3/envs/torch2py311cu12lmft/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/torch2py311cu12lmft/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call** return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:

------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-11-23_16:02:11 host : autodl-container-b33448a49f-90764891 rank : 1 (local_rank: 1) exitcode : -6 (pid: 1352) error_file: traceback : Signal 6 (SIGABRT) received by PID 1352 =======================================================

CaiJichang212 commented 2 weeks ago

增大--ddp_timeout 360000000 \也无效

hiyouga / LLaMA-Factory

多卡训练lora超时 #74

/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED