Open Wuyingwen opened 2 months ago
这个会比较奇怪,怎么可能阻塞30分钟都拿不到数据
py-spy dump --pid xxx
看下每个进程都阻塞在了哪里
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
--model_type qwen2-vl-7b-instruct \
--model_id_or_path Qwen2-VL-7B-Instruct \
--sft_type full \
--freeze_vit false \
--max_length 2048 \
--lazy_tokenize true \
--gradient_accumulation_step 2 \
--batch_size 1 \
--num_train_epochs 1 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--lr_scheduler_type cosine \
--warmup_ratio 0.05 \
--save_steps 200 \
--logging_steps 1 \
--dataloader_num_workers 8 \
--dataset qwen2-vl-val.jsonl \
--dataset_test_ratio 0.005 \
--output_dir qwen2-vl-7b-20240912 \
--deepspeed default-zero2
遇到同样问题
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2
遇到同样问题
py-spy进程主要有两种结果:
Process 175053: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)
Thread 175053 (active): "MainThread"
synchronize (torch/cuda/__init__.py:792)
synchronize (deepspeed/accelerator/cuda_accelerator.py:78)
independent_gradient_partition_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:764)
overlapping_partition_gradients_reduce_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:863)
allreduce_gradients (deepspeed/runtime/engine.py:1912)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (deepspeed/runtime/engine.py:1993)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (accelerate/utils/deepspeed.py:166)
backward (accelerate/accelerator.py:2151)
training_step (transformers/trainer.py:3452)
_inner_training_loop (transformers/trainer.py:2326)
train (transformers/trainer.py:1991)
train (swift/trainers/mixin.py:426)
llm_sft (swift/llm/sft.py:413)
x_main (swift/utils/run_utils.py:32)
<module> (swift/cli/sft.py:5)
Thread 175383 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 175888 (idle): "Thread-2"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 176363 (idle): "Thread-3 (_pin_memory_loop)"
select (selectors.py:416)
wait (multiprocessing/connection.py:931)
_poll (multiprocessing/connection.py:424)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:31)
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:54)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 176488 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Process 177063: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)
Thread 177063 (idle): "MainThread"
select (selectors.py:416)
wait (multiprocessing/connection.py:931)
_poll (multiprocessing/connection.py:424)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
_worker_loop (torch/utils/data/_utils/worker.py:275)
run (multiprocessing/process.py:108)
_bootstrap (multiprocessing/process.py:314)
_launch (multiprocessing/popen_fork.py:71)
__init__ (multiprocessing/popen_fork.py:19)
_Popen (multiprocessing/context.py:281)
_Popen (multiprocessing/context.py:224)
start (multiprocessing/process.py:121)
__init__ (torch/utils/data/dataloader.py:1040)
_get_iterator (torch/utils/data/dataloader.py:387)
__iter__ (torch/utils/data/dataloader.py:439)
__iter__ (accelerate/data_loader.py:451)
_inner_training_loop (transformers/trainer.py:2284)
train (transformers/trainer.py:1991)
train (swift/trainers/mixin.py:426)
llm_sft (swift/llm/sft.py:413)
x_main (swift/utils/run_utils.py:32)
<module> (swift/cli/sft.py:5)
Thread 177190 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 177191 (idle): "Thread-3 (_serve)"
accept (socket.py:293)
accept (multiprocessing/connection.py:609)
accept (multiprocessing/connection.py:463)
_serve (multiprocessing/resource_sharer.py:138)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2
遇到同样问题
问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2
遇到同样问题
问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2
遇到同样问题
问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。
2114
拉取最新代码+更新transformers==4.45.0+更新accelerate==0.34.2 还是出现训练卡住的现象
Train: 0%| | 0/40340 [00:00<?, ?it/s][WARNING:swift] Current length of row(2130) is larger than the max_length(2048), deleted.
[WARNING:swift] Current length of row(3365) is larger than the max_length(2048), deleted.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[ERROR:swift] Error occurs in lazy tokenize: File not found: /mnt_wg/zhoumo.xjq/TDS1M/video/335337510318.mp4
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
pip list | grep swift看看
pip list | grep swift看看
root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift
Name: ms-swift
Version: 2.5.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: contact@modelscope.cn
License: Apache License 2.0
Location: /root/swift
Editable project location: /root/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl
Required-by:
Describe the bug 使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error'
该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢? 我的运行命令: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft \ --model_type minicpm-v-v2_6-chat \ --model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6 \ --sft_type lora \ --dataset xxx.json \ --save_steps 50 \ --val_dataset xxx.json \ --deepspeed default-zero2
torch版本:2.1.2+cu118 训练中途: