Closed apachemycat closed 3 days ago
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,v_proj,down_proj,up_proj,q_proj,k_proj,gate_proj
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 8, in
请问你用的哪个镜像,这个问题应该是少算子包
麻烦提供下镜像信息,感谢
多谢 解决了,问了华为技术人员,npu-smi 23.rc 系列的驱动不能用CANN version: 8.0.的, 升级到npu-smi 24.1.rc1 是可以配套CANN version: 8.0.RC的, 此时,你们提供的那个NPU镜像里,基础镜像需要对应改为 FROM cosdt/cann:8.0.rc1-910-openeuler22.03 就配套NPU驱动了
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
llamafactory-cli train /models/llama-factory-llama3-train/llama3_lora_sft.yaml
RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error! EZ9999: 2024-06-28-13:05:55.631.066 Parse dynamic kernel config fail. TraceBack (most recent call last): AclOpKernelInit failed opType Op Cast does not has any binary. Kernel Run failed. opType: 3, Cast launch failed for Cast, errno:561000.
[ERROR] 2024-06-28-13:05:55 (PID:11591, Device:0, RankID:-1) ERR01005 OPS internal error
Expected behavior
正常训练
Others
No response