microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.82k stars 4.05k forks source link

Train model with NPU #4892

Closed zkyseu closed 8 months ago

zkyseu commented 8 months ago

Hi, how can I train my model on Ascend NPU? Could you provide me with some examples? Thanks!

CurryRice233 commented 8 months ago

Install dependent libraries like CANN, torch, torch_npu, and Deepspeed. Then just add import torch_npu at the beginning of the model, other operations are the same as other accelerators.

More information see: https://www.hiascend.com/zh/document

minchao-sun commented 8 months ago

Can you provide more info about your environment?

First make sure you are able to successfully run torch_npu like:

>>> import torch
>>> import torch_npu
>>> a = torch.tensor([1])
>>> a
tensor([1])
>>> a.to('npu:0')
tensor([1], device='npu:0')
zkyseu commented 8 months ago

Hi, I have installed the deepspeed==0.9.2 and torch_npu. But I run the BingBertSquad and meet the following error

EZ9999: Inner Error!
EZ9999  Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1518]
        TraceBack (most recent call last):
        The error from device(2), serial number is 37, there is an aicore error, core id is 0, error code = 0x10, dump info: pc start: 0x1000124080280000, current: 0x124080280250, vec error info: 0x159fdfdf, mte error info: 0xa5, ifu error info: 0x27b7f3f75a400, ccu error info: 0xffdbd1ff00609e8d, cube error info: 0x72, biu error info: 0, aic error mask: 0x65000200d000288, para base: 0x1240c0411800, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:523]
        The extend info from device(2), serial number is 37, there is aicore error, core id is 0, aicore int: 0x1, aicore error2: 0, axi clamp ctrl: 0, axi clamp state: 0x1717, biu status0: 0x101e44800000000, biu status1: 0x940002092a0000, clk gate mask: 0x1000, dbg addr: 0, ecc en: 0, mte ccu ecc 1bit error: 0, vector cube ecc 1bit error: 0, run stall: 0x1, dbg data0: 0, dbg data1: 0, dbg data2: 0, dbg data3: 0, dfx data: 0[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:554]
        The device(2), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:577]
        coreId( 0):            0x10    [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:591]
        Aicore kernel execute failed, device_id=0, stream_id=13, report_stream_id=13, task_id=2118, flip_num=0, fault kernel_name=53/69_-1_53_NLLLoss_tvmbin, program id=69, hash=690753179031245533.[FUNC:GetError][FILE:stream.cc][LINE:1418]
        [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1418]
        rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
        synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

DEVICE[0] PID[33781]: 
EXCEPTION STREAM:
  Exception info:TGID=33781, model id=65535, stream id=13, stream phase=3
  Message info[0]:RTS_HWTS: aicore exception, slot_id=36, stream_id=13
    Other info[0]:time=2024-01-04-03:57:13.963.085, function=int_process_hwts_task_exception, line=1704, error code=0x26
Iteration:   0%|                                                                                                                                                                                                                                                 | 0/29324 [00:11<?, ?it/s]
Epoch:   0%|                                                                                                                                                                                                                                                         | 0/2 [00:11<?, ?it/s]
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1169, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 1023, in main
    1 - args.loss_plot_alpha) * loss.item()
RuntimeError: ACL stream synchronize failed.

How can I solve this problem? My device information is

+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc2                 Version: 23.0.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 2     910B                | OK            | 62.8        41                0    / 0             |
| 0                         | 0000:01:00.0  | 0           2363 / 15039      3    / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+

CANN version is 5.0RC2 and torch==1.11

CurryRice233 commented 8 months ago

Hi, I have installed the deepspeed==0.9.2 and torch_npu. But I run the BingBertSquad and meet the following error

EZ9999: Inner Error!
EZ9999  Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1518]
        TraceBack (most recent call last):
        The error from device(2), serial number is 37, there is an aicore error, core id is 0, error code = 0x10, dump info: pc start: 0x1000124080280000, current: 0x124080280250, vec error info: 0x159fdfdf, mte error info: 0xa5, ifu error info: 0x27b7f3f75a400, ccu error info: 0xffdbd1ff00609e8d, cube error info: 0x72, biu error info: 0, aic error mask: 0x65000200d000288, para base: 0x1240c0411800, errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:523]
        The extend info from device(2), serial number is 37, there is aicore error, core id is 0, aicore int: 0x1, aicore error2: 0, axi clamp ctrl: 0, axi clamp state: 0x1717, biu status0: 0x101e44800000000, biu status1: 0x940002092a0000, clk gate mask: 0x1000, dbg addr: 0, ecc en: 0, mte ccu ecc 1bit error: 0, vector cube ecc 1bit error: 0, run stall: 0x1, dbg data0: 0, dbg data1: 0, dbg data2: 0, dbg data3: 0, dfx data: 0[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:554]
        The device(2), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:577]
        coreId( 0):            0x10    [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:591]
        Aicore kernel execute failed, device_id=0, stream_id=13, report_stream_id=13, task_id=2118, flip_num=0, fault kernel_name=53/69_-1_53_NLLLoss_tvmbin, program id=69, hash=690753179031245533.[FUNC:GetError][FILE:stream.cc][LINE:1418]
        [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1418]
        rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
        synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

DEVICE[0] PID[33781]: 
EXCEPTION STREAM:
  Exception info:TGID=33781, model id=65535, stream id=13, stream phase=3
  Message info[0]:RTS_HWTS: aicore exception, slot_id=36, stream_id=13
    Other info[0]:time=2024-01-04-03:57:13.963.085, function=int_process_hwts_task_exception, line=1704, error code=0x26
Iteration:   0%|                                                                                                                                                                                                                                                 | 0/29324 [00:11<?, ?it/s]
Epoch:   0%|                                                                                                                                                                                                                                                         | 0/2 [00:11<?, ?it/s]
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1169, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 1023, in main
    1 - args.loss_plot_alpha) * loss.item()
RuntimeError: ACL stream synchronize failed.

How can I solve this problem? My device information is

+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc2                 Version: 23.0.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 2     910B                | OK            | 62.8        41                0    / 0             |
| 0                         | 0000:01:00.0  | 0           2363 / 15039      3    / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+

CANN version is 5.0RC2 and torch==1.11

Looks like your CANN version is too old, 5.0 is two years old, try to update the CANN version.

zkyseu commented 8 months ago

@CurryRice233 Hi, I have updated the CANN to 6.3RC2 but the error still exists.

zkyseu commented 8 months ago

I follow the deepspeed_npu to install the deepspeed and deepspeed_npu

zkyseu commented 8 months ago

@CurryRice233 Could you help me solve this problem? I just added torch_npu at the beginning of the code.

CurryRice233 commented 8 months ago

@CurryRice233 Hi, I have updated the CANN to 6.3RC2 but the error still exists.

The newest version is 7.0, try to this https://www.hiascend.com/developer/download

If the same problem persists, contact Ascend Technical Support.