昇腾910B单机多卡断点续训问题

freely12 commented 2 months ago

我的显卡是昇腾910B，使用如下命令进行单机多卡训练： NPROC_PER_NODE=2 \ ASCEND_RT_VISIBLE_DEVICES=0,1 \ swift sft \ --model_type qwen1half-7b \ --model_id_or_path /root/work/filestorage/Qwen-1.5/Qwen1.5-7B/ \ --dataset /root/work/filestorage/law_model/model_sft/data/law_sft_test4_format.jsonl \ --num_train_epochs 2 \ --sft_type lora \ --output_dir /root/work/filestorage/law_model/model_sft/output \ --ddp_backend hccl \ --use_flash_attn False ，训练过程正常并产生存档，然后用如下命令断点续训，发生报错： NPROC_PER_NODE=2 \ ASCEND_RT_VISIBLE_DEVICES=0,1 \ ASCEND_LAUNCH_BLOCKING=1 \ swift sft \ --model_type qwen1half-7b \ --model_id_or_path /root/work/filestorage/Qwen-1.5/Qwen1.5-7B/ \ --dataset /root/work/filestorage/law_model/model_sft/data/law_sft_test4_format.jsonl \ --num_train_epochs 2 \ --sft_type lora \ --output_dir /root/work/filestorage/law_model/model_sft/output \ --resume_from_checkpoint /root/work/filestorage/law_model/model_sft/output/qwen1half-7b/v16-20240920-153225/checkpoint-58 \ --use_flash_attn False \ --resume_only_model False \ --ddp_backend hccl 报错信息如下：不知是什么原因，现在训练、推理都没有问题，唯一的问题就是断点续训，希望能够得到解答，或者哪位大佬能分享个多机多卡的断点续训最佳实践，谢谢啦

Jintao-Huang commented 2 months ago

之前的权重是zero3跑的吗

Jintao-Huang commented 2 months ago

--resume_only_model true

freely12 commented 2 months ago

--resume_only_model true

您好，感谢您回复我，之前应该不是用zero3跑的，我改成--resume_only_model true，依然报错，单机单卡模式下断点续训是可以进行的，多卡就不行了。 360截图20240922144532098

modelscope / ms-swift

昇腾910B单机多卡断点续训问题 #2085