hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.21k stars 4.21k forks source link

昇腾910b推理baichuan2-13B模型报错:The operator 'aten::isin.Tensor_Tensor_out' is not currently supported on the NPU backend and will 待解决 #4836

Closed fuqiang-benz closed 3 months ago

fuqiang-benz commented 4 months ago

Reminder

System Info

显卡 Atlas 800T A2 训练服务器

驱动 Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run

固件 Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run

cann Ascend-cann-toolkit_8.0.RC2.alpha003_linux-aarch64.run

二进制算子 Ascend-cann-kernels-910b_8.0.RC2.alpha003_linux.run

python 3.9

torch 2.1.0 arrch架构

torch_npu 2.1.0 arrch架构

Reproduction

命令:llamafactory-cli webui 报错信息如下: [W VariableFallbackKernel.cpp:51] Warning: CAUTION: The operator 'aten::isin.Tensor_Tensor_out' is not currently supported on the NPU backend and will fall back to run on the CPU. This may have performance implications. (function npu_cpu_fallback)

[E OpParamMaker.cpp:273] call failed, detail:EZ9999: Inner Error!

EZ9999: 2024-07-15-09:43:29.442.049 The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 28, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xf105ff7781, mte error info: 0xd1030000cb, ifu error info: 0x57d3ae946edc0, ccu error info: 0x13f7e72e7cfa86f1, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207]

    TraceBack (most recent call last): 

    The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:0, tslot:0, thread:0, ctxid:0, blk:3, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219] 

    The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 29, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xd10bd76ce4, mte error info: 0xd1030000cb, ifu error info: 0x3e9cf7fffec80, ccu error info: 0x824aec4c5d357479, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207] 

    The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:0, tslot:0, thread:0, ctxid:0, blk:4, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219] 

    The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 25, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xdf003c17b4, mte error info: 0xd1030000cb, ifu error info: 0x5a26852088580, ccu error info: 0x71cfe4fc7d3621d6, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207] 

    The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219] 

    The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 26, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0xf11035bdae, mte error info: 0xd1030000cb, ifu error info: 0x7860306fb8400, ccu error info: 0x66e3882017adb063, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207] 

    The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:1, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219] 

    The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 27, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0x3614d67886, mte error info: 0xd1030000cb, ifu error info: 0x704b2712ff0c0, ccu error info: 0x9da618017c750b95, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207] 

    The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:2, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219] 

    The error from device(chipId:2, dieId:0), serial number is 7, there is an aivec error exception, core id is 30, error code = 0x800000, dump info: pc start: 0x124200534688, current: 0x1242005348c4, vec error info: 0x591a5ca30a, mte error info: 0xd1030000cb, ifu error info: 0x756aef2ab0fc0, ccu error info: 0x218eab3c508d485a, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x12424040e400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1207] 

    The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30000cb, fixp_error1 info: 0xd1 fsmId:1, tslot:0, thread:0, ctxid:0, blk:5, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1219] 

    Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinic_kernel_task.cc][LINE:1201] 

    AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1082] 

    Aicore kernel execute failed, device_id=2, stream_id=2, report_stream_id=2, task_id=57, flip_num=0, fault kernel_name=Maximum_ee98c6628030785f610b924ab1557b31_high_performance_223000000, fault kernel info ext=none, program id=34, hash=9727548023731030879.[FUNC:GetError][FILE:stream.cc][LINE:1082] 

    [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1082] 

    Failed to submit kernel task, retCode=0x715005e.[FUNC:LaunchKernelSubmit][FILE:context.cc][LINE:675] 

    kernel launch submit failed.[FUNC:LaunchKernelWithHandle][FILE:context.cc][LINE:891] 

    rtKernelLaunchWithHandleV2 execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] 

    rtKernelLaunchWithHandleV2 failed: 507035 

[ERROR] 2024-07-15-09:43:29 (PID:2746749, Device:2, RankID:-1) ERR01005 OPS internal error

Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/AdaptiveAvgPool2dBackwardKernelNpuOpApi.cpp:532 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xfffed936d898 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x6c (0xfffed93262a8 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #2: + 0x82098c (0xfffd87df098c in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)

frame #3: + 0xe28ea0 (0xfffd883f8ea0 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)

frame #4: + 0x56a7a0 (0xfffd87b3a7a0 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)

frame #5: + 0x56abc8 (0xfffd87b3abc8 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)

frame #6: + 0x568aa0 (0xfffd87b38aa0 in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch_npu/lib/libtorch_npu.so)

frame #7: + 0x946ec (0xfffed93946ec in /root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #8: + 0x7d5c8 (0xffff7f4cd5c8 in /lib/aarch64-linux-gnu/libc.so.6)

frame #9: + 0xe5edc (0xffff7f535edc in /lib/aarch64-linux-gnu/libc.so.6)

Traceback (most recent call last):

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/queueing.py", line 536, in process_events

response = await route_utils.call_process_api( 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/route_utils.py", line 276, in call_process_api

output = await app.get_blocks().process_api( 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/blocks.py", line 1897, in process_api

result = await self.call_function( 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/blocks.py", line 1483, in call_function

prediction = await anyio.to_thread.run_sync( 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/anyio/to_thread.py", line 56, in run_sync

return await get_async_backend().run_sync_in_worker_thread( 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread

return await future 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 859, in run

result = context.run(func, *args) 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/gradio/utils.py", line 816, in wrapper

response = f(*args, **kwargs) 

File "/home/LLM/llm_projects/BELLE/train/src/entry_point/interface.py", line 71, in evaluate

generation_output = model.generate( 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

return func(*args, **kwargs) 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/utils.py", line 1914, in generate

result = self._sample( 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/utils.py", line 2666, in _sample

next_token_scores = logits_processor(input_ids, next_token_logits) 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/logits_process.py", line 98, in call

scores = processor(input_ids, scores) 

File "/root/miniconda3/envs/env_llm_mixed_test/lib/python3.9/site-packages/transformers/generation/logits_process.py", line 157, in call

eos_token_mask = torch.isin(vocab_tensor, self.eos_token_id) 

RuntimeError: The Inner error is reported as above.

Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.

Expected behavior

大家有碰见这个问题吗,感谢大家了,帮忙给看看。 如果大家有昇腾服务器群,给我留个二维码

Others

No response

codemayq commented 3 months ago

wechat_npu

linzm1007 commented 3 months ago

二维码过期

yimuu commented 2 months ago

这个问题怎么解决的啊

liuyijiang1994 commented 1 month ago

这个问题怎么解决的啊

leoneyar commented 1 week ago

我也遇到了,有大佬解决了吗