modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.16k stars 368 forks source link

训练中途突然报错 NCCL watchdog thread terminated with exception #1817

Open Wuyingwen opened 2 months ago

Wuyingwen commented 2 months ago

Describe the bug 使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error'

image

该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢? 我的运行命令: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft \ --model_type minicpm-v-v2_6-chat \ --model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6 \ --sft_type lora \ --dataset xxx.json \ --save_steps 50 \ --val_dataset xxx.json \ --deepspeed default-zero2

torch版本:2.1.2+cu118 训练中途:

image
tastelikefeet commented 2 months ago

这个会比较奇怪,怎么可能阻塞30分钟都拿不到数据 py-spy dump --pid xxx 看下每个进程都阻塞在了哪里

yunkchen commented 1 month ago
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

yunkchen commented 1 month ago
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

py-spy进程主要有两种结果:

Process 175053: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)

Thread 175053 (active): "MainThread"
    synchronize (torch/cuda/__init__.py:792)
    synchronize (deepspeed/accelerator/cuda_accelerator.py:78)
    independent_gradient_partition_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:764)
    overlapping_partition_gradients_reduce_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:863)
    allreduce_gradients (deepspeed/runtime/engine.py:1912)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (deepspeed/runtime/engine.py:1993)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2151)
    training_step (transformers/trainer.py:3452)
    _inner_training_loop (transformers/trainer.py:2326)
    train (transformers/trainer.py:1991)
    train (swift/trainers/mixin.py:426)
    llm_sft (swift/llm/sft.py:413)
    x_main (swift/utils/run_utils.py:32)
    <module> (swift/cli/sft.py:5)
Thread 175383 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 175888 (idle): "Thread-2"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 176363 (idle): "Thread-3 (_pin_memory_loop)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:31)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:54)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 176488 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Process 177063: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)

Thread 177063 (idle): "MainThread"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    _worker_loop (torch/utils/data/_utils/worker.py:275)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _launch (multiprocessing/popen_fork.py:71)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:281)
    _Popen (multiprocessing/context.py:224)
    start (multiprocessing/process.py:121)
    __init__ (torch/utils/data/dataloader.py:1040)
    _get_iterator (torch/utils/data/dataloader.py:387)
    __iter__ (torch/utils/data/dataloader.py:439)
    __iter__ (accelerate/data_loader.py:451)
    _inner_training_loop (transformers/trainer.py:2284)
    train (transformers/trainer.py:1991)
    train (swift/trainers/mixin.py:426)
    llm_sft (swift/llm/sft.py:413)
    x_main (swift/utils/run_utils.py:32)
    <module> (swift/cli/sft.py:5)
Thread 177190 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 177191 (idle): "Thread-3 (_serve)"
    accept (socket.py:293)
    accept (multiprocessing/connection.py:609)
    accept (multiprocessing/connection.py:463)
    _serve (multiprocessing/resource_sharer.py:138)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Nioolek commented 1 month ago
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

Jintao-Huang commented 1 month ago
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

https://github.com/modelscope/ms-swift/pull/2114

yunkchen commented 1 month ago
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

2114

拉取最新代码+更新transformers==4.45.0+更新accelerate==0.34.2 还是出现训练卡住的现象

Train:   0%|          | 0/40340 [00:00<?, ?it/s][WARNING:swift] Current length of row(2130) is larger than the max_length(2048), deleted.
[WARNING:swift] Current length of row(3365) is larger than the max_length(2048), deleted.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[ERROR:swift] Error occurs in lazy tokenize: File not found: /mnt_wg/zhoumo.xjq/TDS1M/video/335337510318.mp4
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
Jintao-Huang commented 1 month ago

pip list | grep swift看看

yunkchen commented 1 month ago

pip list | grep swift看看

root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift
Name: ms-swift
Version: 2.5.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: contact@modelscope.cn
License: Apache License 2.0
Location: /root/swift
Editable project location: /root/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl
Required-by: