modelscope / swift

ms-swift: Use PEFT or Full-parameter to finetune 300+ LLMs or 40+ MLLMs. (Qwen2, GLM4, Internlm2.5, Yi, Llama3, Llava, MiniCPM-V, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://github.com/modelscope/swift/blob/main/docs/source/LLM/index.md
Apache License 2.0
2.24k stars 215 forks source link

SimPO微调报错 #1101

Open zhangfan-algo opened 4 weeks ago

zhangfan-algo commented 4 weeks ago

Describe the bug image 2024-06-07 16:13:25 frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe5ccc77c62 in /root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-07 16:13:25 frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fe5ccc7ca80 in /root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-07 16:13:25 frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe5ccc7ddcc in /root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-07 16:13:25 frame #4: + 0xdc2b3 (0x7fe621cdf2b3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 2024-06-07 16:13:25 frame #5: + 0x94b43 (0x7fe62a7acb43 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-06-07 16:13:25 frame #6: + 0x126a00 (0x7fe62a83ea00 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-06-07 16:13:25 2024-06-07 16:13:25 terminate called after throwing an instance of 'c10::DistBackendError' 2024-06-07 16:13:25 what(): [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=843, OpType=ALLREDUCE, NumelIn=496797696, NumelOut=496797696, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. 2024-06-07 16:13:25 Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): 2024-06-07 16:13:25 frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe618ccf897 in /root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)

Additional context torchrun --nproc_per_node ${num_gpu_per_node} --master_port $MASTER_PORT --master_addr $MASTER_ADDR --node_rank $RANK --nnodes $WORLD_SIZE examples/pytorch/llm/llm_simpo.py \ --model_cache_dir /mnt/cluster/zhangfan/study_info/swift_0522/output/qwen1half-1_8b-chat/v0-20240604-110136/checkpoint-4060 \ --model_type qwen1half-1_8b-chat \ --sft_type full \ --tuner_backend swift \ --template_type AUTO \ --ddp_backend nccl \ --custom_train_dataset_path /mnt/cluster/simPO_data_train_qwen1half5_1_8B_simpo_full_0605 \ --preprocess_num_proc 60 \ --dataloader_num_workers 60 \ --train_dataset_sample -1 \ --evaluation_strategy steps \ --eval_steps 50 \ --eval_batch_size 1 \ --dataset_test_ratio 0.01 \ --max_length 19500 \ --lr_scheduler_type cosine \ --num_train_epochs 5 \ --save_total_limit 5 \ --save_strategy epoch \ --logging_steps 10 \ --batch_size 1 \ --check_dataset_strategy warning \ --gradient_checkpointing true \ --gradient_accumulation_steps 8 \ --weight_decay 0.01 \ --learning_rate 1e-5 \ --max_grad_norm 0.5 \ --warmup_ratio 0.03 \ --use_flash_attn true \ --push_to_hub false \ --lazy_tokenize true \ --deepspeed_config_path ds_z2_offload_config.json \ --save_only_model true \ --save_on_each_node false \ --neftune_noise_alpha 5 \ --dtype AUTO

hjh0119 commented 4 weeks ago

SimPO的代码在重构了, 重构完我测下你这个case吧

hjh0119 commented 2 weeks ago

现在再试试呢 , 下载trl源码( 0.9.5.dev0)

zhangfan-algo commented 2 weeks ago

好的我试试

获取 Outlook for iOShttps://aka.ms/o0ukef


发件人: jinghanhu @.> 发送时间: Wednesday, June 19, 2024 3:12:55 PM 收件人: modelscope/swift @.> 抄送: zhangfan-algo @.>; Author @.> 主题: Re: [modelscope/swift] SimPO微调报错 (Issue #1101)

现在再试试呢 , 下载trl源码( 0.9.5.dev0)

― Reply to this email directly, view it on GitHubhttps://github.com/modelscope/swift/issues/1101#issuecomment-2177921649, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALMJFNC3SJS5OTRFMWCHPY3ZIEVPPAVCNFSM6AAAAABI6GCFJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXHEZDCNRUHE. You are receiving this because you authored the thread.Message ID: @.***>