Closed ghost closed 1 year ago
I noticed you're running with Ubuntu 20.04. However, to run with Arc A770 or any Intel GPU, you need Ubuntu 22.04: https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html.
I created an environment using the software versions you mentioned, and I'm able to reproduce Example 1 (issue escalated to team) but not Examples 3 or 4. Example 2 could be related to your OS.
Please update your OS and retry all your examples.
I noticed you're running with Ubuntu 20.04. However, to run with Arc A770 or any Intel GPU, you need Ubuntu 22.04: https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html.
I'm running Arch Linux (I know, it's not an officially supported distro). I already have a Ubuntu 22.04 partition with latest oneapi and pytorch-gpu packages. Everything works correctly, except examples 1 and 2 (same bug?).
I opened this ticket because, I used to run example 3 on 2023.0.0 and it worked fine (or maybe was it the version before that one? Can't remember exactly) on Arch Linux. So it seems to me like there is some regression going on.
...
Regarding Native API errors, it would be nice if they could contain more info. What I know is that PI_ERROR_DEVICE_NOT_FOUND is a faux error. My device is always available. It works with OpenAI's Whisper, sycl-ls
returns
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 5 5600 6-Core Processor 3.0 [2023.15.3.0.20_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.09.25812]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.25812]
I also compiled this example with dpcpp, and it found my device and worked fine.
This seems like a bug in Intel's stack somewhere related to timeout/hanging/locking.
Bottom line: it works on Ubuntu 22.04, that's for certain (minus examples 1 & 2, of course). In any case: I will wait for next release and report back (PI_ERROR_DEVICE_NOT_FOUND/PI_ERROR_COMMAND_EXECUTION_FAILURE are llvm/dpcpp related [?], they maybe should be transferred there if not fixed by next release?).
@stacksmash76 This issue has the same root cause as https://github.com/intel/intel-extension-for-pytorch/issues/358. The MKL routine gesvd crashes with certain inputs. We are working with MKL team for fixing this.
@stacksmash76 Our team has triaged and fixed this issue. It will be available in the next IPEX release.
@stacksmash76 IPEX v2.0.110+xpu is now available: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu
Let us know if your issue is resolved using this version.
Thanks for the heads up. Currently I'm on vacation and will be able to revisit this in 2 weeks at the earliest.
@stacksmash76 IPEX v2.0.110+xpu is now available: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu
Let us know if your issue is resolved using this version.
I can confirm that this issue has been fixed. Thank you for your work.
Describe the bug
I've been getting native API errors on certain compute workloads. While this has been mentioned in other tickets (to some extent), I'm not sure it's a duplicate. If it is, please close.
Examples
Trying code from issue ticket #358 gives me:
RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
Trying https://github.com/vladmandic/automatic (Stable Diffusion), gives me:
Please see gist for entire error message: https://gist.github.com/stacksmash76/1f75061d5749dcb9fe67f96db49a5dfe
In dmesg output (even with disabled hangcheck), I get:
Running code from https://github.com/intel/intel-extension-for-pytorch/issues/296#issuecomment-1430940262 gives:
Running code from https://github.com/intel/intel-extension-for-pytorch/issues/296#issuecomment-1562260804 gives:
RuntimeError: Native API failed. Native API returns: -997 (The plugin has emitted a backend specific error) -997 (The plugin has emitted a backend specific error)
for for first 2 array statements, but last 3 statements do not throw an exceptionBut other workloads execute successfully. For example OpenAI's Whisper (large-v2 model which is 2.9GB). And it runs on the GPU, confirmed with
intel_gpu_top
: Blitter goes to about 70+% and [unknown] floats around 20%.Versions
clinfo output: https://gist.github.com/stacksmash76/082a9c71a599286a6f0248a63cb308b8 collect_env.py output: https://gist.github.com/stacksmash76/cb507dd4de89d947bd3e5c4d14eb6813
Packages (I'm on Arch):
lspci -v |grep -A8 VGA
hwinfo --display
cat /sys/module/i915/parameters/enable_hangcheck
N