Closed feria-tu closed 4 months ago
readme里有npu相关的配置说明, 另外npu相关 可以进如下群做进一步交流
使用 env
打印一下环境变量?
CONDA_SHLVL=2 LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/tools/hccn_tool/:/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/:/usr/lib/aarch64-linux-gnu/hdf5/serial:/usr/local/python3.7.5/lib: CONDA_EXE=/root/miniconda3/bin/conda TOOLCHAIN_HOME=/usr/local/Ascend/ascend-toolkit/latest/toolkit HOSTNAME=002 OLDPWD=/home/HwHiAiUser CONDA_PREFIX=/root/miniconda3/envs/llama ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest _CE_M= TBE_IMPL_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe CONDA_PREFIX_1=/root/miniconda3 HF_ENDPOINT=https://hf-mirror.com PWD=/home/HwHiAiUser/LLaMA-Factory HOME=/root CONDA_PYTHON_EXE=/root/miniconda3/bin/python _CE_CONDA= ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest GLOG_v=2 CONDA_PROMPT_MODIFIER=(llama) TERM=xterm ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest SHLVL=1 PYTHONPATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe: PATH=/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/root/miniconda3/envs/llama/bin:/root/miniconda3/condabin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/ccec_compiler/bin:/usr/local/python3.7.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CONDA_DEFAULT_ENV=llama ASCEND_LAUNCHBLOCKING=1 =/usr/bin/env
910B 设备上不能使用 32G (910A) 的镜像,需要安装 910B 对应的 CANN kernels
readme 里有 npu 相关的配置说明,另外 npu 相关可以进如下群做进一步交流
您好,群里人数超了,加不了,能拉进群吗
readme 里有 npu 相关的配置说明,另外 npu 相关可以进如下群做进一步交流
您好,群里人数超了,加不了,能拉进群吗
+1
+1
+1
+1
Reminder
Reproduction
运行
ASCEND_RT_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui
启动可视化界面后,尝试进行训练时报错`05/17/2024 01:30:26 - WARNING - llmtuner.model.utils.checkpointing - You are using the old GC format, some features (e.g. BAdam) will be invalid. 05/17/2024 01:30:26 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled. 05/17/2024 01:30:26 - INFO - llmtuner.model.utils.attention - Using vanilla attention implementation. 05/17/2024 01:30:26 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA [E OpParamMaker.cpp:273] call aclnnCast failed, detail:EZ9999: Inner Error! EZ9999: 2024-05-17-01:30:26.960.959 Op Cast does not has any binary. TraceBack (most recent call last): Kernel Run failed. opType: 53, Cast launch failed for Cast, errno:561000.
[ERROR] 2024-05-17-01:30:26 (PID:41961, Device:0, RankID:-1) ERR01005 OPS internal error Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/CastKernelNpuOpApi.cpp:33 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xffffa7858538 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x6c (0xffffa78058a0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: + 0x8ddac0 (0xfffdc1d3fac0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0xe2696c (0xfffdc228896c in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x56b9f0 (0xfffdc19cd9f0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x56be18 (0xfffdc19cde18 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x569e20 (0xfffdc19cbe20 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0xafe0c (0xffffa788ae0c in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: + 0x7088 (0xffffb1c12088 in /lib/aarch64-linux-gnu/libpthread.so.0)
Traceback (most recent call last): File "/root/miniconda3/envs/llama/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/cli.py", line 49, in main
run_exp()
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 34, in run_sft
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/model/loader.py", line 137, in load_model
model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/model/adapter.py", line 196, in init_adapter
param.data = param.data.to(torch.float32)
RuntimeError: The Inner error is reported as above.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[ERROR] 2024-05-17-01:30:26 (PID:41961, Device:0, RankID:-1) ERR00100 PTA call acl api failed`
Expected behavior
No response
System Info
No response
Others
No response