`RuntimeError: could not create a primitive` for `torch.matmul` on Arc A730M and Arc A750 for Windows

Oscilloscope98 commented 8 months ago

Describe the bug

Machine: Arc A730M (Also met same bug on Arc A750) OS: WIndows 11 Driver: 31.0.101.5081 (Also met same bug with version 31.0.101.5084) oneAPI: 2024.0

Code to reproduce:

call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
python test.py

test.py:

import torch
import intel_extension_for_pytorch as ipex

tensor1 = torch.randn(1, 1, 40, 128).to('xpu')
tensor2 = torch.randn(1, 1, 128, 40).to('xpu')
print(tensor1.dtype)

torch.matmul(tensor1, tensor2).size()

Error message:

Traceback (most recent call last):
  File "D:\yuwen\test.py", line 8, in <module>
    torch.matmul(tensor1, tensor2).size()
RuntimeError: could not create a primitive

The error is still there even if we used set ONEAPI_DEVICE_SELECTOR=level_zero:0 to make only A730M available to the environment.

sycl-ls output on machine with A730M:

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i7-12700H OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A730M Graphics OpenCL 3.0 NEO  [31.0.101.5081]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5081]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A730M Graphics 1.3 [1.3.27616]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.27616]

Versions

PyTorch version: 2.1.0a0+cxx11.abi PyTorch CXX11 ABI: No IPEX version: 2.1.10+xpu IPEX commit: a12f9f650 Build type: Release

OS: Microsoft Windows 11 专业版 GCC version: N/A Clang version: N/A IGC version: 2024.0.0 (2024.0.0.20231017) CMake version: version 3.27.2-msvc1 Libc version: N/A

Python version: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22621-SP0 Is XPU available: True DPCPP runtime version: N/A MKL version: N/A GPU models and configuration: [0] _DeviceProperties(name='Intel(R) Arc(TM) A730M Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=11934MB, max_compute_units=384, gpu_eu_count=384) [1] _DeviceProperties(name='Intel(R) Iris(R) Xe Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=14751MB, max_compute_units=96, gpu_eu_count=96) Intel OpenCL ICD version: N/A Level Zero version: N/A

CPU: Architecture=9 CurrentClockSpeed=2300 DeviceID=CPU0 Family=198 L2CacheSize=7680 L2CacheSpeed= Manufacturer=GenuineIntel MaxClockSpeed=2300 Name=12th Gen Intel(R) Core(TM) i7-12700H ProcessorType=3 Revision=

Versions of relevant libraries: [pip3] intel-extension-for-pytorch==2.1.10+xpu [pip3] numpy==1.26.3 [pip3] torch==2.1.0a0+cxx11.abi [pip3] torchaudio==2.1.0a0+cxx11.abi [pip3] torchvision==0.16.0a0+cxx11.abi [conda] intel-extension-for-pytorch 2.1.10+xpu pypi_0 pypi [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.1.0a0+cxx11.abi pypi_0 pypi [conda] torchaudio 2.1.0a0+cxx11.abi pypi_0 pypi [conda] torchvision 0.16.0a0+cxx11.abi pypi_0 pypi

jingxu10 commented 8 months ago

@ashokei @min-jean-cho FYI. issue on Windows

min-jean-cho commented 8 months ago

@Oscilloscope98, is the error reproducible with different input sizes (e.g., smaller input sizes)?

Oscilloscope98 commented 8 months ago

Hi @min-jean-cho,

The same problem happened for

import torch
import intel_extension_for_pytorch as ipex

tensor1 = torch.randn(1, 1, 1, 2).to('xpu')
tensor2 = torch.randn(1, 1, 2, 1).to('xpu')

torch.matmul(tensor1, tensor2).size()

P.S. test Driver: 31.0.101.5081, test machine Arc A730M

jingxu10 commented 6 months ago

Working on triage.

jingxu10 commented 5 months ago

Do you have the graphics card attached to a monitor and disable iGPU in BIOS?

Oscilloscope98 commented 5 months ago

Do you have the graphics card attached to a monitor and disable iGPU in BIOS?

Hi @jingxu10,

For Arc A730M, it is an NUC machine, and we did not disable iGPU in BIOS.

For Arc A750, I am not sure whether the graphics card was attached to a monitor, but the iGPU was also not disabled in BIOS.

jingxu10 commented 5 months ago

We found an issue that the card has to be attached to a monitor and disable iGPU to get the dGPU working. We are working on triaging this issue.

NeoZhangJianyu commented 5 months ago

@Oscilloscope98 Please use following cmd to set the GPU you want. for example, if you want to use the second GPU: export ZE_AFFINITY_MASK=1

liu-shaojun commented 5 months ago

I followed https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html#verify-installation to reproduce this issue, after setting

set ZE_AFFINITY_MASK=0
set ONEAPI_DEVICE_SELECTOR=level_zero:1

the same error still occurs MicrosoftTeams-image (3)

Oscilloscope98 commented 5 months ago

Hi @NeoZhangJianyu,

We tried again on a Windows machine (with Intel(R) UHD Graphics 770 and Intel(R) Arc(TM) A770 Graphics available, driver 31.0.101.5382), it seems like set ZE_AFFINITY_MASK=1 will make all xpu device unavailable:

blaz-r commented 4 months ago

Hi, I had a similar problem on my NUC as well. When you have two GPUs (I have one Iris and one Arc) you need to also specify which to choose like this: to("xpu:1"), since in my case ARC was on index 1. Hope this helps.

vpirogov commented 4 months ago

This issue is root caused to GPU hardware detection logic for multi-GPU systems in oneDNN. The fix is available in oneDNN v3.4.3.

@jingxu10, @min-jean-cho, it would be awesome to have IPEX patch release with this fix.

jingxu10 commented 3 months ago

yeah, WIP.

huangrui666 commented 1 month ago

yeah, WIP.

hi @jingxu10 , I meet this issue on a Linux machine with Intel(R) UHD Graphics 770 and Intel(R) Arc(TM) A750 Graphics available. May I ask how this fix working on? Will it be ported to Linux solution? Thanks!

jingxu10 commented 1 month ago

Pls try disabling iGPU as a workaround at this time. We are still working on the solution.

Shengqi-Kong commented 3 weeks ago

I have the same problem. It happened when I want to use the conv2d(). At first I think it may be something wrong with gpu. So I changed the device from 'cuda:0' to 'cpu'. Unforturnately, it still happened.

x = F.relu(self.conv1(inputs))

Traceback:

File "/home/vic/miniconda3/envs/new3.8/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: could not create a primitive

intel / intel-extension-for-pytorch

`RuntimeError: could not create a primitive` for `torch.matmul` on Arc A730M and Arc A750 for Windows #508

Describe the bug

Versions