Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
4.49k
stars
395
forks
source link
昇腾910B单机多卡断点续训问题 #2085
Open
freely12 opened 2 months ago
我的显卡是昇腾910B,使用如下命令进行单机多卡训练: NPROC_PER_NODE=2 \ ASCEND_RT_VISIBLE_DEVICES=0,1 \ swift sft \ --model_type qwen1half-7b \ --model_id_or_path /root/work/filestorage/Qwen-1.5/Qwen1.5-7B/ \ --dataset /root/work/filestorage/law_model/model_sft/data/law_sft_test4_format.jsonl \ --num_train_epochs 2 \ --sft_type lora \ --output_dir /root/work/filestorage/law_model/model_sft/output \ --ddp_backend hccl \ --use_flash_attn False ,训练过程正常并产生存档,然后用如下命令断点续训,发生报错: NPROC_PER_NODE=2 \ ASCEND_RT_VISIBLE_DEVICES=0,1 \ ASCEND_LAUNCH_BLOCKING=1 \ swift sft \ --model_type qwen1half-7b \ --model_id_or_path /root/work/filestorage/Qwen-1.5/Qwen1.5-7B/ \ --dataset /root/work/filestorage/law_model/model_sft/data/law_sft_test4_format.jsonl \ --num_train_epochs 2 \ --sft_type lora \ --output_dir /root/work/filestorage/law_model/model_sft/output \ --resume_from_checkpoint /root/work/filestorage/law_model/model_sft/output/qwen1half-7b/v16-20240920-153225/checkpoint-58 \ --use_flash_attn False \ --resume_only_model False \ --ddp_backend hccl 报错信息如下: 不知是什么原因,现在训练、推理都没有问题,唯一的问题就是断点续训,希望能够得到解答,或者哪位大佬能分享个多机多卡的断点续训最佳实践,谢谢啦