hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.62k stars 3.17k forks source link

NCCL watchdog thread terminated with exception #2814

Closed CXLiang123 closed 3 months ago

CXLiang123 commented 3 months ago

Reminder

Reproduction

CUDA_VISIBLE_DEVICES="0,1,2,3"

export NCCL_P2P_LEVEL=NVL base_model="/data/models/qwen/Qwen1.5-72B-Chat-GPTQ-Int8" lora_checkpoint="/data/cxl/saves/Qwen1.5-72B-Chat-GPTQ-Int8/lora/qw72B-1"

output_dir="/data/cxl/saves/Qwen1.5-14B-Chat/merged_lora5_all3"

temp_dir="/data/cxl/saves/Qwen1.5-72B-Chat-GPTQ-Int8/lora/qw72B-1"

deepspeed --include localhost:0,1,2,3 \ src/train_bash.py \ --deepspeed fulltune_zero2.json \ --stage sft \ --do_train True \ --model_name_or_path $base_model \ --finetuning_type lora \ --template qwen \ --dataset_dir data \ --dataset V4_train \ --cutoff_len 1024 \ --learning_rate 5e-5 \ --num_train_epochs 5 \ --max_samples 100000 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 6 \ --lr_scheduler_type cosine \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --max_grad_norm 1.0 \ --logging_steps 1 \ --save_steps 1000 \ --lora_rank 8 \ --lora_dropout 0.1 \ --lora_target q_proj,k_proj,v_proj,gate_proj \ --output_dir $lora_checkpoint \ --overwrite_output_dir \ --bf16 True \ --gradient_checkpointing \ --plot_loss True

Expected behavior

我阅读了相关问题,我发现之前的问题是出现在数据处理阶段,而我是已经在训练了,然后才出的这个错误,相关参数信息见上下文 显卡为四卡的80G a100 deepspeed 3不支持量化版本所以是2

System Info

image

{'loss': 5.5168, 'learning_rate': 5.882352941176471e-06, 'epoch': 0.01}

0%| | 2/1665 [00:57<13:10:21, 28.52s /it][E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=825 , OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800771 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog cau ght collective operation timeout: WorkNCCL(SeqNum=825, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms) =1800000) ran for 1800771 milliseconds before timing out.

Others

No response

hiyouga commented 3 months ago

--ddp_timeout 1800000000

JerryDaHeLian commented 3 months ago

数据集小没问题,数据集大就会timeout,很可能卡在tokenizer on dataset这一步,如果是,通过设置: --preprocessing_num_workers 128 解决。