Open Wetitpig opened 1 month ago
same issue here, after meet a bunch of NaN
when trying train_mnist.py,I want to confirm if this appears only on NPU, so I use model = model.to('xpu')
instead ofmodel = intel_npu_acceleration_library.compile(model, dtype=torch.float32, training=True)
, then met the same error with Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)
, comment import intel_npu_acceleration_library
does make the code works as designed.
Platform info
The same problem happens with the upstream XPU Pytorch (https://pytorch.org/docs/2.5/notes/get_start_xpu.html) instead intel-extension-for-pytorch when installed together with the Intel NPU software stack.
The actual issue seems to be the mismatch between NPU's level-zero provider (intel-level-zero-npu) and the level-zero library used in Pytorch, see the ze_driver.cpp:186
error:
$ export ZE_INTEL_NPU_LOGLEVEL=INFO
$ python
Python 3.12.7 (main, Nov 8 2024, 17:55:36) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.xpu.is_available()
NPU_LOG: [DRIVER][driver.cpp:91] OS interface updated
NPU_LOG: [DEVICE][os_interface.cpp:21] OS interface set.
NPU_LOG: [CACHE][disk_cache.cpp:85] Cache is initialized, path: /home/pioto/.cache/ze_intel_npu_cache, max size: 1073741824
...
NPU_LOG: [DRIVER][driver.cpp:70] Current driver init status is 0
NPU_LOG: [DRIVER][driver.cpp:70] Current driver init status is 0
NPU_LOG: [DRIVER][driver_handle.cpp:71] Driver properties returned successfully
NPU_LOG: *ERROR* [ze_driver.cpp:186] The name of extension is unknown: zexDriverImportExternalPointer
NPU_LOG: [DEVICE][device.cpp:167] Returning device properties
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pioto/projects/ai/lib/python3.12/site-packages/torch/xpu/__init__.py", line 66, in is_available
return device_count() > 0
^^^^^^^^^^^^^^
File "/home/pioto/projects/ai/lib/python3.12/site-packages/torch/xpu/__init__.py", line 60, in device_count
return torch._C._xpu_getDeviceCount()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)
>>>
On the same system Openvino works fine with both the NPU and GPU:
$ export ZE_INTEL_NPU_LOGLEVEL=INFO
$ python
Python 3.12.7 (main, Nov 8 2024, 17:55:36) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import openvino as ov
>>> core = ov.Core()
>>> core.available_devices
NPU_LOG: [DRIVER][driver.cpp:91] OS interface updated
NPU_LOG: [DEVICE][os_interface.cpp:21] OS interface set.
NPU_LOG: [CACHE][disk_cache.cpp:85] Cache is initialized, path: /home/pioto/.cache/ze_intel_npu_cache, max size: 1073741824
NPU_LOG: [IOCTL][vpu_driver_api.cpp:90] DRM_IOCTL_VERSION
NPU_LOG: [IOCTL][vpu_driver_api.cpp:90] DRM_IOCTL_VERSION
...
NPU_LOG: [DRIVER][ze_driver.cpp:125] Return DDI table for extension: ZE_extension_profiling_data
NPU_LOG: [DEVICE][vpu_device_context.cpp:34] VPUDeviceContext is created
NPU_LOG: [DEVICE][device.cpp:167] Returning device properties
NPU_LOG: [DEVICE][vpu_driver_api.cpp:366] Device path: /sys/dev/char/261:0
NPU_LOG: [DEVICE][vpu_driver_api.cpp:367] Device path link: ../../devices/pci0000:00/0000:00:0b.0/accel/accel0
NPU_LOG: [DEVICE][device.cpp:528] Device BDF: 0000:00:0b.0
['CPU', 'GPU.0', 'GPU.1', 'NPU']
>>>
A clumsy work-around to temporarily enable Pytorch/XPU is to uninstall intel-level-zero-npu
package:
$ sudo apt remove intel-level-zero-npu
...
$ python
Python 3.12.7 (main, Nov 8 2024, 17:55:36) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.xpu.is_available()
True
>>>
However that breaks the NPU support. Installing back the NPU package (intel-level-zero-npu) from: https://github.com/intel/linux-npu-driver/releases fixes the NPU (but breaks Pytorch/XPU).
It is non-ideal to have to install/uninstall the package to be able to flip between the two but it seems to work. Hopefully they fix it soon.
Describe the bug Different errors occur when
intel-extension-for-pytorch
andintel-npu-acceleration-library
are imported simultaneously in different orders.To Reproduce 2 different orders of importing packages:
intel-extension-for-pytorch
followed byintel-npu-acceleration-library
intel-npu-acceleration-library
followed byintel-extension-for-pytorch
Expected behavior Both modules can be successfully imported and used.
Desktop (please complete the following information):
intel_npu_acceleration_library==1.3.0
intel_extension_for_pytorch==2.3.110+xpu
torch==2.3.1+cxx11.abi
Additional context Task manager shows both GPU and NPU.