命令:llamafactory-cli webui
报错信息如下:
[W VariableFallbackKernel.cpp:51] Warning: CAUTION: The operator 'aten::isin.Tensor_Tensor_out' is not currently supported on the NPU backend and will fall back to run on the CPU. This may have performance implications. (function npu_cpu_fallback)
EZ9999: 2024-07-15-09:43:29.442.049 The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 28, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xf105ff7781, mte error info: 0xd1030000cb, ifu error info: 0x57d3ae946edc0, ccu error info: 0x13f7e72e7cfa86f1, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]
TraceBack (most recent call last):
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:0, tslot:0, thread:0, ctxid:0, blk:3, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219]
The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 29, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xd10bd76ce4, mte error info: 0xd1030000cb, ifu error info: 0x3e9cf7fffec80, ccu error info: 0x824aec4c5d357479, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:0, tslot:0, thread:0, ctxid:0, blk:4, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219]
The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 25, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xdf003c17b4, mte error info: 0xd1030000cb, ifu error info: 0x5a26852088580, ccu error info: 0x71cfe4fc7d3621d6, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219]
The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 26, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xf11035bdae, mte error info: 0xd1030000cb, ifu error info: 0x7860306fb8400, ccu error info: 0x66e3882017adb063, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:1, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219]
The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 27, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0x3614d67886, mte error info: 0xd1030000cb, ifu error info: 0x704b2712ff0c0, ccu error info: 0x9da618017c750b95, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:2, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219]
The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 30, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0x591a5ca30a, mte error info: 0xd1030000cb, ifu error info: 0x756aef2ab0fc0, ccu error info: 0x218eab3c508d485a, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:5, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219]
Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinic_kernel_task.cc][LINE:1201]
AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1082]
Aicore kernel execute failed, device_id=2, stream_id=2, report_stream_id=2, task_id=57, flip_num=0, fault kernel_name=Maximum_ee98c6628030785f610b924ab1557b31_high_performance_223000000, fault kernel info ext=none, program id=34, hash=9727548023731030879.[FUNC:GetError][FILE:stream.cc][LINE:1082]
[AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1082]
Failed to submit kernel task, retCode=0x715005e.[FUNC:LaunchKernelSubmit][FILE:context.cc][LINE:675]
kernel launch submit failed.[FUNC:LaunchKernelWithHandle][FILE:context.cc][LINE:891]
rtKernelLaunchWithHandleV2 execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
rtKernelLaunchWithHandleV2 failed: 507035
Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/AdaptiveAvgPool2dBackwardKernelNpuOpApi.cpp:532 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xfffed936d898 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/lib/libc10.so)
RuntimeError: The Inner error is reported as above.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
Reminder
System Info
显卡 Atlas 800T A2 训练服务器
驱动 Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run
固件 Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
cann Ascend-cann-toolkit_8.0.RC2.alpha003_linux-aarch64.run
二进制算子 Ascend-cann-kernels-910b_8.0.RC2.alpha003_linux.run
python 3.9
torch 2.1.0 arrch架构
torch_npu 2.1.0 arrch架构
Reproduction
命令:llamafactory-cli webui 报错信息如下: [W VariableFallbackKernel.cpp:51] Warning: CAUTION: The operator 'aten::isin.Tensor_Tensor_out' is not currently supported on the NPU backend and will fall back to run on the CPU. This may have performance implications. (function npu_cpu_fallback)
[E OpParamMaker.cpp:273] call failed, detail:EZ9999: Inner Error!
EZ9999: 2024-07-15-09:43:29.442.049 The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 28, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xf105ff7781, mte error info: 0xd1030000cb, ifu error info: 0x57d3ae946edc0, ccu error info: 0x13f7e72e7cfa86f1, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]
[ERROR] 2024-07-15-09:43:29 (PID:2746749, Device:2, RankID:-1) ERR01005 OPS internal error
Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/AdaptiveAvgPool2dBackwardKernelNpuOpApi.cpp:532 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xfffed936d898 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x6c (0xfffed93262a8 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: + 0x82098c (0xfffd87df098c in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0xe28ea0 (0xfffd883f8ea0 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x56a7a0 (0xfffd87b3a7a0 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x56abc8 (0xfffd87b3abc8 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x568aa0 (0xfffd87b38aa0 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0x946ec (0xfffed93946ec in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #8: + 0x7d5c8 (0xffff7f4cd5c8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: + 0xe5edc (0xffff7f535edc in /lib/aarch64-linux-gnu/libc.so.6)
Traceback (most recent call last):
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/queueing.py", line 536, in process_events
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/route_utils.py", line 276, in call_process_api
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/blocks.py", line 1897, in process_api
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/blocks.py", line 1483, in call_function
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/anyio/to_thread.py", line 56, in run_sync
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 859, in run
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/utils.py", line 816, in wrapper
File "/home/LLM/llm_projects/BELLE/train/src/entry_point/interface.py", line 71, in evaluate
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/utils.py", line 1914, in generate
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/utils.py", line 2666, in _sample
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/logits_process.py", line 98, in call
File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/logits_process.py", line 157, in call
RuntimeError: The Inner error is reported as above.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
Expected behavior
大家有碰见这个问题吗,感谢大家了,帮忙给看看。 如果大家有昇腾服务器群,给我留个二维码
Others
No response