Closed Louis-y-nlp closed 1 year ago
是不是用了 nohup?
没有起后台,在docker中直接运行的。
简单增加超时时间似乎不能解决问题,测试了下是卡在logging step上了,应该是其他rank等待rank 0计算loss时卡死了,暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪,会有多个进度条。logging step设置为20的时候进度条为:
0%|▏ | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
0%|▎ | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >> Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >> Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >> Number of trainable parameters = 4,194,304
0%|▎ | 2/1170 [00:54<8:55:09, 27.49s/it]
0%| | 0/2343 [00:00<?, ?it/s]
1%|█▍ | 20/2343 [05:24<7:37:56, 11.83s/it]
同时gpu利用率也一直是100%
关闭 deepspeed 试试,用普通的 accelerate config。
依旧卡在logging step中。 config yaml 如下:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: ''
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
另外昨天logging step 设置为无穷大之后,在一个save step时,成功保存了一个ckpt之后卡住了,经历了7200s(指定的超时时间)之后报了相同的错误。
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3056, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7207694 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
试试这个 config:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
还是会在logging step卡住
[INFO|trainer.py:1779] 2023-06-26 02:34:58,141 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-26 02:34:58,142 >> Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-26 02:34:58,142 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-26 02:34:58,142 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-26 02:34:58,142 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-26 02:34:58,142 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-26 02:34:58,142 >> Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-26 02:34:58,144 >> Number of trainable parameters = 4,194,304
0%|▎ | 2/1170 [00:54<8:54:55, 27.48s/it]
0%| | 0/2343 [00:00<?, ?it/s[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7205324 milliseconds before timing out.11:15:08, 17.35s/it]
f07b9fe29941:61323:61360 [1] NCCL INFO [Service thread] Connection closed by localRank 1
f07b9fe29941:61323:61344 [0] NCCL INFO comm 0x4724c640 rank 1 nranks 2 cudaDev 1 busId d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=7200000) ran for 7206742 milliseconds before timing out.
f07b9fe29941:61322:61361 [0] NCCL INFO [Service thread] Connection closed by localRank 0
f07b9fe29941:61322:61341 [0] NCCL INFO comm 0x48215190 rank 0 nranks 2 cudaDev 0 busId c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[04:37:26] ERROR failed (exitcode: -6) local_rank: 0 (pid: 61322) of binary: /root/anaconda3/envs/dolly/bin/python api.py:672
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/anaconda3/envs/dolly/bin/accelerate:8 in <module> │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if __name__ == '__main__': │
│ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45 │
│ in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if __name__ == "__main__": │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:928 in │
│ launch_command │
│ │
│ 925 │ │ args.deepspeed_fields_from_accelerate_config = ",".join(args.deepspeed_fields_fr │
│ 926 │ │ deepspeed_launcher(args) │
│ 927 │ elif args.use_fsdp and not args.cpu: │
│ ❱ 928 │ │ multi_gpu_launcher(args) │
│ 929 │ elif args.use_megatron_lm and not args.cpu: │
│ 930 │ │ multi_gpu_launcher(args) │
│ 931 │ elif args.multi_gpu and not args.cpu: │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:627 in │
│ multi_gpu_launcher │
│ │
│ 624 │ ) │
│ 625 │ with patch_environment(**current_env): │
│ 626 │ │ try: │
│ ❱ 627 │ │ │ distrib_run.run(args) │
│ 628 │ │ except Exception: │
│ 629 │ │ │ if is_rich_available() and debug: │
│ 630 │ │ │ │ console = get_console() │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/run.py:785 in run │
│ │
│ 782 │ │ ) │
│ 783 │ │
│ 784 │ config, cmd, cmd_args = config_from_args(args) │
│ ❱ 785 │ elastic_launch( │
│ 786 │ │ config=config, │
│ 787 │ │ entrypoint=cmd, │
│ 788 │ )(*cmd_args) │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:134 in │
│ __call__ │
│ │
│ 131 │ │ self._entrypoint = entrypoint │
│ 132 │ │
│ 133 │ def __call__(self, *args): │
│ ❱ 134 │ │ return launch_agent(self._config, self._entrypoint, list(args)) │
│ 135 │
│ 136 │
│ 137 def _get_entrypoint_name( │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:250 in │
│ launch_agent │
│ │
│ 247 │ │ │ # if the error files for the failed children exist │
│ 248 │ │ │ # @record will copy the first error (root cause) │
│ 249 │ │ │ # to the error file of the launcher process. │
│ ❱ 250 │ │ │ raise ChildFailedError( │
│ 251 │ │ │ │ name=entrypoint_name, │
│ 252 │ │ │ │ failures=result.failures, │
│ 253 │ │ │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError:
======================================================
src/train_sft.py FAILED
------------------------------------------------------
Failures:
[1]:
time : 2023-06-26_04:37:26
host : f07b9fe29941
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 61323)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 61323
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-26_04:37:26
host : f07b9fe29941
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 61322)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 61322
======================================================
把 NCCL 同步关了
加了NCCL_P2P_DISABLE=1之后第一步就会挂 @shaonianyr
@Louis-y-nlp 多卡微调,跑通吗
没跑通,docker里一直卡死。
简单增加超时时间似乎不能解决问题,测试了下是卡在logging step上了,应该是其他rank等待rank 0计算loss时卡死了,暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪,会有多个进度条。logging step设置为20的时候进度条为:
0%|▏ | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration. 06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration. 0%|▎ | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training ***** [INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >> Num examples = 50,000 [INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >> Num Epochs = 3 [INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >> Instantaneous batch size per device = 16 [INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >> Total optimization steps = 2,343 [INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >> Number of trainable parameters = 4,194,304 0%|▎ | 2/1170 [00:54<8:55:09, 27.49s/it] 0%| | 0/2343 [00:00<?, ?it/s] 1%|█▍ | 20/2343 [05:24<7:37:56, 11.83s/it]
同时gpu利用率也一直是100%
@Louis-y-nlp
How do you set the timeout value with accelerate launch
?
@GitYCC
torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))
@Louis-y-nlp 请问您的多卡微调跑通了吗
没啊,多卡一直卡死,主要没有任何报错也不知道怎么调,就单卡能跑。
能跑
那有没有找到其他的解决方案,我试了好几个都不行。
拉了最新版本代码跑通了。
大佬神速啊,24小时高强度在线。
是不是用了 nohup?
您好,我也遇到了同样的问题,我使用了nohup进行后台挂起训练,请问这是什么原因呀 具体来说我的使用nohup在后台运行了一个使用deepspeed进行训练的代码,在运行了大概1000多个step后报错: Connection closed by localRank -1 然后就停掉了
是不是用了 nohup?
想问一下,用了nohup就会有这个问题吗?
init_process_group
这个应该加在哪呢?
torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))
这个加在那呀,我设置 --ddp_time 加载数据集 可以顺利加载一次,但是运行的时候要加载两次data_tokenizer,第二次就报错了。
数据集小没问题,数据集大就会timeout,很可能卡在tokenizer on dataset这一步,如果是,通过设置: --preprocessing_num_workers 128 解决。
我也遇到了NCCL Timeout问题,对qwen2-vl-7b,仅lora微调时正常运行。如果 下面是我的命令
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# 指定GPU编号
export CUDA_VISIBLE_DEVICES=0,1
# 单机多卡训练
export FORCE_TORCHRUN=1
# If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
# See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# 激活llamafactory的虚拟环境
source activate torch2py311cu12lmft # 命令行base环境
cur_date=$(date +%Y-%m-%d)
mkdir -p /root/autodl-fs/log/$cur_date
# 选择模型
# model=qwen2_vl_2b
model=qwen2_vl_7b
# 选择微调方法
# method=full
method=lora
# 任务阶段,指令监督微调
task=sft
# 默认值
pdbs=1 # per device batch size
gas=5 # gradient accumulation steps
bs=$((gas*pdbs*2)) # total batch size
steps=10 # logging steps
epoch=10 # number of epochs 默认3,+2,+2=7
lr=5e-5 # learning rate
max_grad_norm=1.0 # gradient clipping threshold 主流开源大模型默认1.0
cutoff_len=2048 # 截断长度尝试2048会不会报错
cnt=3
lr=9e-5
gas=8
bs=$((gas*pdbs*2))
echo ${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}
llamafactory-cli train \
--freeze_vision_tower false \
--max_grad_norm $max_grad_norm \
--output_dir /root/autodl-fs/saves/${model}/${method}-${task}/lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm} \
--logging_steps $steps \
--save_strategy epoch \
--per_device_train_batch_size $pdbs \
--gradient_accumulation_steps $gas \
--learning_rate $lr \
--num_train_epochs $epoch \
--model_name_or_path /root/autodl-fs/huggingface/Qwen2-VL-7B-Instruct \
--stage $task \
--do_train true \
--finetuning_type $method \
--lora_target all \
--dataset mire_train_check \
--template qwen2_vl \
--cutoff_len $cutoff_len \
--max_samples 1000 \
--overwrite_cache true \
--preprocessing_num_workers 16 \
--plot_loss true \
--overwrite_output_dir true \
--lr_scheduler_type cosine \
--warmup_ratio 0.1 \
--bf16 true \
--ddp_timeout 180000000 \
--flash_attn fa2 \
--enable_liger_kernel true \
--deepspeed examples/deepspeed/ds_z2_config.json \
--eval_dataset mire_train_check \
--per_device_eval_batch_size $((pdbs*2)) \
--eval_strategy epoch \
> /root/autodl-fs/log/${cur_date}/${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}.log 2>&1 && /usr/bin/shutdown
报错信息;卡在13/620 {'loss': 0.8541, 'grad_norm': 6.580702781677246, 'learning_rate': 1.4516129032258065e-05, 'epoch': 0.16}
2%|█▋ | 10/620 [00:59<58:46, 5.78s/it]
2%|█▊ | 11/620 [01:05<59:23, 5.85s/it]
2%|█▉ | 12/620 [01:11<58:11, 5.74s/it]
2%|██▏ | 13/620 [01:17<58:26, 5.78s/it][rank1]:[E1123 16:02:10.083090635 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out.
[rank1]:[E1123 16:02:10.083642868 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516.
[rank0]:[E1123 16:02:10.114598162 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1516, OpType=ALLREDUCE, NumelIn=25427968, NumelOut=25427968, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank0]:[E1123 16:02:10.115071453 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1516, last enqueued NCCL work: 1516, last completed NCCL work: 1515.
[rank1]:[E1123 16:02:11.535545146 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516.
[rank1]:[E1123 16:02:11.535815819 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1123 16:02:11.535957791 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1123 16:02:11.538140975 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5af1f77f86 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5aa3f5f8d2 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f5aa3f66313 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5aa3f686fc in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
Failures:
增大--ddp_timeout 360000000 \也无效
您好,使用v100进行多卡训练总会遇到超时错误,4卡、2卡均报错。使用单卡似乎没有这种问题但是速度较慢。微调5w数据大约需要12小时。
运行脚本
default_config.yaml