modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.21k stars 370 forks source link

模型训练到固定step时, NCCL超时 #2359

Closed samaritan1998 closed 13 hours ago

samaritan1998 commented 2 weeks ago

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图) 模型训练到固定step的时候,NCCL超时 6512a21092e02368ce384707d830cf8b

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

CUDA版本: 12.2 GPU型号: 8卡A100 40G torch版本: 2.4.0 accelerate: 0.34.0

Additional context Add any other context about the problem here(在这里补充其他信息)

samaritan1998 commented 1 week ago

batchsize设置为1就好了:) 奇怪