hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.52k stars 3.16k forks source link

华为NPU训练不了,用的例子里的训练脚本,镜像也是官方镜像 #4610

Closed apachemycat closed 3 days ago

apachemycat commented 4 days ago

Reminder

System Info

Reproduction

llamafactory-cli train /models/llama-factory-llama3-train/llama3_lora_sft.yaml

RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error! EZ9999: 2024-06-28-13:05:55.631.066 Parse dynamic kernel config fail. TraceBack (most recent call last): AclOpKernelInit failed opType Op Cast does not has any binary. Kernel Run failed. opType: 3, Cast launch failed for Cast, errno:561000.

[ERROR] 2024-06-28-13:05:55 (PID:11591, Device:0, RankID:-1) ERR01005 OPS internal error

Expected behavior

正常训练

Others

No response

apachemycat commented 4 days ago

06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation. 06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,v_proj,down_proj,up_proj,q_proj,k_proj,gate_proj Traceback (most recent call last): File "/usr/local/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/LLaMA-Factory/src/llamafactory/cli.py", line 111, in main run_exp() File "/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) File "/LLaMA-Factory/src/llamafactory/model/loader.py", line 161, in load_model model = init_adapter(config, model, model_args, finetuning_args, is_trainable) File "/LLaMA-Factory/src/llamafactory/model/adapter.py", line 310, in init_adapter model = _setup_lora_tuning( File "/LLaMA-Factory/src/llamafactory/model/adapter.py", line 265, in _setup_lora_tuning param.data = param.data.to(torch.float32) RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error! EZ9999: 2024-06-28-13:24:50.334.697 Parse dynamic kernel config fail. TraceBack (most recent call last): AclOpKernelInit failed opType Op Cast does not has any binary. Kernel Run failed. opType: 3, Cast launch failed for Cast, errno:561000.

MengqingCao commented 4 days ago

请问你用的哪个镜像,这个问题应该是少算子包

shink commented 4 days ago

麻烦提供下镜像信息,感谢

apachemycat commented 3 days ago

多谢 解决了,问了华为技术人员,npu-smi 23.rc 系列的驱动不能用CANN version: 8.0.的, 升级到npu-smi 24.1.rc1 是可以配套CANN version: 8.0.RC的, 此时,你们提供的那个NPU镜像里,基础镜像需要对应改为 FROM cosdt/cann:8.0.rc1-910-openeuler22.03 就配套NPU驱动了