modelscope / swift

ms-swift: Use PEFT or Full-parameter to finetune 250+ LLMs or 35+ MLLMs. (Qwen2, GLM4, Internlm2, Yi, Llama3, Llava, MiniCPM-V, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://github.com/modelscope/swift/blob/main/docs/source/LLM/index.md
Apache License 2.0
2.13k stars 205 forks source link

Training 过程卡住 #1204

Closed zkyredstart closed 3 hours ago

zkyredstart commented 1 week ago

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图) Qwen1.5用lora进行sft时候,训练日志一直处理不更新,GPU利用率为0

image

一直是这个状态,我将取数据集中前100个数据,就可以训练了,这是什么原因

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

Additional context Add any other context about the problem here(在这里补充其他信息)

zkyredstart commented 1 week ago

数据集是train1.jaon, train2.json, train3.json 合成一个train.json,三个子集都可以正常训练,合成一个大的就不行了