intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library
Apache License 2.0
516 stars 56 forks source link

Using `intel-extension-for-pytorch` and `intel-npu-acceleration-library` #132

Open Wetitpig opened 1 month ago

Wetitpig commented 1 month ago

Describe the bug Different errors occur when intel-extension-for-pytorch and intel-npu-acceleration-library are imported simultaneously in different orders.

To Reproduce 2 different orders of importing packages: intel-extension-for-pytorch followed by intel-npu-acceleration-library

>>> import torch, intel_extension_for_pytorch
>>> torch._C._xpu_init()
>>> import intel_npu_acceleration_library
C:\Users\*****\miniconda3\envs\*****\Lib\site-packages\intel_npu_acceleration_library\backend\__init__.py:18: UserWarning: NPU is not available in your system. Library will fallback to AUTO device selection mode

intel-npu-acceleration-library followed by intel-extension-for-pytorch

>>> import torch, intel_npu_acceleration_library
>>> import intel_extension_for_pytorch
>>> x = torch.zeros(5, device="xpu")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\*****\miniconda3\envs\*****\Lib\site-packages\intel_npu_acceleration_library\device.py", line 66, in __torch_function__
    return super_fn(*args, **kwargs or {})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\*****\miniconda3\envs\*****\Lib\site-packages\intel_npu_acceleration_library\device.py", line 60, in super_fn
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\*****\miniconda3\envs\*****\Lib\site-packages\torch\xpu\__init__.py", line 117, in _lazy_init
    torch._C._xpu_init()
RuntimeError: Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)

Expected behavior Both modules can be successfully imported and used.

Desktop (please complete the following information):

Additional context Task manager shows both GPU and NPU.

GUZZ07 commented 3 weeks ago

same issue here, after meet a bunch of NaN when trying train_mnist.py,I want to confirm if this appears only on NPU, so I use model = model.to('xpu') instead ofmodel = intel_npu_acceleration_library.compile(model, dtype=torch.float32, training=True), then met the same error with Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE), comment import intel_npu_acceleration_library does make the code works as designed.

Platform info

pioto1225 commented 2 days ago

The same problem happens with the upstream XPU Pytorch (https://pytorch.org/docs/2.5/notes/get_start_xpu.html) instead intel-extension-for-pytorch when installed together with the Intel NPU software stack.

The actual issue seems to be the mismatch between NPU's level-zero provider (intel-level-zero-npu) and the level-zero library used in Pytorch, see the ze_driver.cpp:186 error:

$ export ZE_INTEL_NPU_LOGLEVEL=INFO
$ python
Python 3.12.7 (main, Nov  8 2024, 17:55:36) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.xpu.is_available()
NPU_LOG: [DRIVER][driver.cpp:91] OS interface updated
NPU_LOG: [DEVICE][os_interface.cpp:21] OS interface set.
NPU_LOG: [CACHE][disk_cache.cpp:85] Cache is initialized, path: /home/pioto/.cache/ze_intel_npu_cache, max size: 1073741824
...
NPU_LOG: [DRIVER][driver.cpp:70] Current driver init status is 0
NPU_LOG: [DRIVER][driver.cpp:70] Current driver init status is 0
NPU_LOG: [DRIVER][driver_handle.cpp:71] Driver properties returned successfully
NPU_LOG: *ERROR* [ze_driver.cpp:186] The name of extension is unknown: zexDriverImportExternalPointer
NPU_LOG: [DEVICE][device.cpp:167] Returning device properties
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pioto/projects/ai/lib/python3.12/site-packages/torch/xpu/__init__.py", line 66, in is_available
    return device_count() > 0
           ^^^^^^^^^^^^^^
  File "/home/pioto/projects/ai/lib/python3.12/site-packages/torch/xpu/__init__.py", line 60, in device_count
    return torch._C._xpu_getDeviceCount()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)
>>>

On the same system Openvino works fine with both the NPU and GPU:

$ export ZE_INTEL_NPU_LOGLEVEL=INFO
$ python
Python 3.12.7 (main, Nov  8 2024, 17:55:36) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import openvino as ov
>>> core = ov.Core()
>>> core.available_devices
NPU_LOG: [DRIVER][driver.cpp:91] OS interface updated
NPU_LOG: [DEVICE][os_interface.cpp:21] OS interface set.
NPU_LOG: [CACHE][disk_cache.cpp:85] Cache is initialized, path: /home/pioto/.cache/ze_intel_npu_cache, max size: 1073741824
NPU_LOG: [IOCTL][vpu_driver_api.cpp:90] DRM_IOCTL_VERSION
NPU_LOG: [IOCTL][vpu_driver_api.cpp:90] DRM_IOCTL_VERSION
...
NPU_LOG: [DRIVER][ze_driver.cpp:125] Return DDI table for extension: ZE_extension_profiling_data
NPU_LOG: [DEVICE][vpu_device_context.cpp:34] VPUDeviceContext is created
NPU_LOG: [DEVICE][device.cpp:167] Returning device properties
NPU_LOG: [DEVICE][vpu_driver_api.cpp:366] Device path: /sys/dev/char/261:0
NPU_LOG: [DEVICE][vpu_driver_api.cpp:367] Device path link: ../../devices/pci0000:00/0000:00:0b.0/accel/accel0
NPU_LOG: [DEVICE][device.cpp:528] Device BDF: 0000:00:0b.0
['CPU', 'GPU.0', 'GPU.1', 'NPU']
>>>

A clumsy work-around to temporarily enable Pytorch/XPU is to uninstall intel-level-zero-npu package:

$ sudo apt remove intel-level-zero-npu
...
$ python
Python 3.12.7 (main, Nov  8 2024, 17:55:36) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.xpu.is_available()
True
>>> 

However that breaks the NPU support. Installing back the NPU package (intel-level-zero-npu) from: https://github.com/intel/linux-npu-driver/releases fixes the NPU (but breaks Pytorch/XPU).

It is non-ideal to have to install/uninstall the package to be able to flip between the two but it seems to work. Hopefully they fix it soon.