Closed zkyseu closed 8 months ago
Install dependent libraries like CANN, torch, torch_npu, and Deepspeed. Then just add import torch_npu
at the beginning of the model, other operations are the same as other accelerators.
More information see: https://www.hiascend.com/zh/document
Can you provide more info about your environment?
First make sure you are able to successfully run torch_npu like:
>>> import torch
>>> import torch_npu
>>> a = torch.tensor([1])
>>> a
tensor([1])
>>> a.to('npu:0')
tensor([1], device='npu:0')
Hi, I have installed the deepspeed==0.9.2 and torch_npu. But I run the BingBertSquad and meet the following error
EZ9999: Inner Error!
EZ9999 Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1518]
TraceBack (most recent call last):
The error from device(2), serial number is 37, there is an aicore error, core id is 0, error code = 0x10, dump info: pc start: 0x1000124080280000, current: 0x124080280250, vec error info: 0x159fdfdf, mte error info: 0xa5, ifu error info: 0x27b7f3f75a400, ccu error info: 0xffdbd1ff00609e8d, cube error info: 0x72, biu error info: 0, aic error mask: 0x65000200d000288, para base: 0x1240c0411800, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:523]
The extend info from device(2), serial number is 37, there is aicore error, core id is 0, aicore int: 0x1, aicore error2: 0, axi clamp ctrl: 0, axi clamp state: 0x1717, biu status0: 0x101e44800000000, biu status1: 0x940002092a0000, clk gate mask: 0x1000, dbg addr: 0, ecc en: 0, mte ccu ecc 1bit error: 0, vector cube ecc 1bit error: 0, run stall: 0x1, dbg data0: 0, dbg data1: 0, dbg data2: 0, dbg data3: 0, dfx data: 0[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:554]
The device(2), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:577]
coreId( 0): 0x10 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:591]
Aicore kernel execute failed, device_id=0, stream_id=13, report_stream_id=13, task_id=2118, flip_num=0, fault kernel_name=53/69_-1_53_NLLLoss_tvmbin, program id=69, hash=690753179031245533.[FUNC:GetError][FILE:stream.cc][LINE:1418]
[AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1418]
rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
DEVICE[0] PID[33781]:
EXCEPTION STREAM:
Exception info:TGID=33781, model id=65535, stream id=13, stream phase=3
Message info[0]:RTS_HWTS: aicore exception, slot_id=36, stream_id=13
Other info[0]:time=2024-01-04-03:57:13.963.085, function=int_process_hwts_task_exception, line=1704, error code=0x26
Iteration: 0%| | 0/29324 [00:11<?, ?it/s]
Epoch: 0%| | 0/2 [00:11<?, ?it/s]
Traceback (most recent call last):
File "nvidia_run_squad_deepspeed.py", line 1169, in <module>
main()
File "nvidia_run_squad_deepspeed.py", line 1023, in main
1 - args.loss_plot_alpha) * loss.item()
RuntimeError: ACL stream synchronize failed.
How can I solve this problem? My device information is
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc2 Version: 23.0.rc2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 2 910B | OK | 62.8 41 0 / 0 |
| 0 | 0000:01:00.0 | 0 2363 / 15039 3 / 32768 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
CANN version is 5.0RC2 and torch==1.11
Hi, I have installed the deepspeed==0.9.2 and torch_npu. But I run the BingBertSquad and meet the following error
EZ9999: Inner Error! EZ9999 Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1518] TraceBack (most recent call last): The error from device(2), serial number is 37, there is an aicore error, core id is 0, error code = 0x10, dump info: pc start: 0x1000124080280000, current: 0x124080280250, vec error info: 0x159fdfdf, mte error info: 0xa5, ifu error info: 0x27b7f3f75a400, ccu error info: 0xffdbd1ff00609e8d, cube error info: 0x72, biu error info: 0, aic error mask: 0x65000200d000288, para base: 0x1240c0411800, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:523] The extend info from device(2), serial number is 37, there is aicore error, core id is 0, aicore int: 0x1, aicore error2: 0, axi clamp ctrl: 0, axi clamp state: 0x1717, biu status0: 0x101e44800000000, biu status1: 0x940002092a0000, clk gate mask: 0x1000, dbg addr: 0, ecc en: 0, mte ccu ecc 1bit error: 0, vector cube ecc 1bit error: 0, run stall: 0x1, dbg data0: 0, dbg data1: 0, dbg data2: 0, dbg data3: 0, dfx data: 0[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:554] The device(2), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:577] coreId( 0): 0x10 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:591] Aicore kernel execute failed, device_id=0, stream_id=13, report_stream_id=13, task_id=2118, flip_num=0, fault kernel_name=53/69_-1_53_NLLLoss_tvmbin, program id=69, hash=690753179031245533.[FUNC:GetError][FILE:stream.cc][LINE:1418] [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1418] rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] DEVICE[0] PID[33781]: EXCEPTION STREAM: Exception info:TGID=33781, model id=65535, stream id=13, stream phase=3 Message info[0]:RTS_HWTS: aicore exception, slot_id=36, stream_id=13 Other info[0]:time=2024-01-04-03:57:13.963.085, function=int_process_hwts_task_exception, line=1704, error code=0x26 Iteration: 0%| | 0/29324 [00:11<?, ?it/s] Epoch: 0%| | 0/2 [00:11<?, ?it/s] Traceback (most recent call last): File "nvidia_run_squad_deepspeed.py", line 1169, in <module> main() File "nvidia_run_squad_deepspeed.py", line 1023, in main 1 - args.loss_plot_alpha) * loss.item() RuntimeError: ACL stream synchronize failed.
How can I solve this problem? My device information is
+------------------------------------------------------------------------------------------------+ | npu-smi 23.0.rc2 Version: 23.0.rc2 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 2 910B | OK | 62.8 41 0 / 0 | | 0 | 0000:01:00.0 | 0 2363 / 15039 3 / 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 2 | +===========================+===============+====================================================+
CANN version is 5.0RC2 and torch==1.11
Looks like your CANN version is too old, 5.0 is two years old, try to update the CANN version.
@CurryRice233 Hi, I have updated the CANN to 6.3RC2 but the error still exists.
I follow the deepspeed_npu to install the deepspeed and deepspeed_npu
@CurryRice233 Could you help me solve this problem? I just added torch_npu at the beginning of the code.
@CurryRice233 Hi, I have updated the CANN to 6.3RC2 but the error still exists.
The newest version is 7.0, try to this https://www.hiascend.com/developer/download
If the same problem persists, contact Ascend Technical Support.
Hi, how can I train my model on Ascend NPU? Could you provide me with some examples? Thanks!