modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.49k stars 395 forks source link

昇腾910B单机多卡断点续训问题 #2085

Open freely12 opened 2 months ago

freely12 commented 2 months ago

我的显卡是昇腾910B,使用如下命令进行单机多卡训练: NPROC_PER_NODE=2 \ ASCEND_RT_VISIBLE_DEVICES=0,1 \ swift sft \ --model_type qwen1half-7b \ --model_id_or_path /root/work/filestorage/Qwen-1.5/Qwen1.5-7B/ \ --dataset /root/work/filestorage/law_model/model_sft/data/law_sft_test4_format.jsonl \ --num_train_epochs 2 \ --sft_type lora \ --output_dir /root/work/filestorage/law_model/model_sft/output \ --ddp_backend hccl \ --use_flash_attn False ,训练过程正常并产生存档,然后用如下命令断点续训,发生报错: NPROC_PER_NODE=2 \ ASCEND_RT_VISIBLE_DEVICES=0,1 \ ASCEND_LAUNCH_BLOCKING=1 \ swift sft \ --model_type qwen1half-7b \ --model_id_or_path /root/work/filestorage/Qwen-1.5/Qwen1.5-7B/ \ --dataset /root/work/filestorage/law_model/model_sft/data/law_sft_test4_format.jsonl \ --num_train_epochs 2 \ --sft_type lora \ --output_dir /root/work/filestorage/law_model/model_sft/output \ --resume_from_checkpoint /root/work/filestorage/law_model/model_sft/output/qwen1half-7b/v16-20240920-153225/checkpoint-58 \ --use_flash_attn False \ --resume_only_model False \ --ddp_backend hccl 报错信息如下: image 不知是什么原因,现在训练、推理都没有问题,唯一的问题就是断点续训,希望能够得到解答,或者哪位大佬能分享个多机多卡的断点续训最佳实践,谢谢啦

Jintao-Huang commented 2 months ago

之前的权重是zero3跑的吗

Jintao-Huang commented 2 months ago

--resume_only_model true

freely12 commented 2 months ago

--resume_only_model true

您好,感谢您回复我,之前应该不是用zero3跑的,我改成--resume_only_model true,依然报错, 单机单卡模式下断点续训是可以进行的,多卡就不行了。 360截图20240922144532098