模型训练到固定step时， NCCL超时

modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)

Apache License 2.0

4.21k stars 370 forks source link

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图) 模型训练到固定step的时候，NCCL超时 6512a21092e02368ce384707d830cf8b

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

CUDA版本: 12.2 GPU型号: 8卡A100 40G torch版本: 2.4.0 accelerate: 0.34.0

Additional context Add any other context about the problem here(在这里补充其他信息)

modelscope / ms-swift

模型训练到固定step时， NCCL超时 #2359