intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.57k stars 241 forks source link

Native API failed #359

Closed ghost closed 1 year ago

ghost commented 1 year ago

Describe the bug

I've been getting native API errors on certain compute workloads. While this has been mentioned in other tickets (to some extent), I'm not sure it's a duplicate. If it is, please close.

Examples

  1. Trying code from issue ticket #358 gives me: RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

  2. Trying https://github.com/vladmandic/automatic (Stable Diffusion), gives me:

    Exception: Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)
    RuntimeError: Native API failed. Native API returns: -997 (The plugin has emitted a backend specific error) -997 (The plugin has emitted a backend specific error)

    Please see gist for entire error message: https://gist.github.com/stacksmash76/1f75061d5749dcb9fe67f96db49a5dfe

In dmesg output (even with disabled hangcheck), I get:

[  106.297075 <   92.070733>] i915 0000:09:00.0: [drm] GPU HANG: ecode 12:10:00000000
[  106.297079 <    0.000004>] i915 0000:09:00.0: [drm] python[3084] context reset due to GPU hang
[  118.576118 <   12.279039>] Fence expiration time out i915-0000:09:00.0:python[3084]:a22!
  1. Running code from https://github.com/intel/intel-extension-for-pytorch/issues/296#issuecomment-1430940262 gives:

    ...
    ==========
    Transfering 3.2 GB
    Bandwidth 105.1245177207754 GB/s
    ==========
    Traceback (most recent call last):
    File "<stdin>", line 6, in <module>
    RuntimeError: Native API failed. Native API returns: -997 (The plugin has emitted a backend specific error) -997 (The plugin has emitted a backend specific error)
  2. Running code from https://github.com/intel/intel-extension-for-pytorch/issues/296#issuecomment-1562260804 gives: RuntimeError: Native API failed. Native API returns: -997 (The plugin has emitted a backend specific error) -997 (The plugin has emitted a backend specific error) for for first 2 array statements, but last 3 statements do not throw an exception

  3. But other workloads execute successfully. For example OpenAI's Whisper (large-v2 model which is 2.9GB). And it runs on the GPU, confirmed with intel_gpu_top: Blitter goes to about 70+% and [unknown] floats around 20%.

Versions

clinfo output: https://gist.github.com/stacksmash76/082a9c71a599286a6f0248a63cb308b8 collect_env.py output: https://gist.github.com/stacksmash76/cb507dd4de89d947bd3e5c4d14eb6813

Packages (I'm on Arch):

intel-compute-runtime 23.09.25812.14-1
intel-gmmlib 22.3.3-1
intel-gpu-tools 1.27-1
intel-graphics-compiler 1:1.0.13463.18-1
intel-media-driver 23.1.0-1
intel-media-sdk 23.2.2-1
intel-metee 3.1.5-1
intel-oneapi-basekit 2023.1.0.46401-1
intel-opencl-clang 15.0.0-1

lspci -v |grep -A8 VGA

09:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation DG2 [Arc A770]
        Flags: bus master, fast devsel, latency 0, IRQ 72, IOMMU group 17
        Memory at fb000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 7800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at fc000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+

hwinfo --display

11: PCI 900.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.386]
  Unique ID: x1VA.nDfmnmUHKH4
  Parent ID: KILV.mr2N3fBJq5F
  SysFS ID: /devices/pci0000:00/0000:00:03.1/0000:07:00.0/0000:08:01.0/0000:09:00.0
  SysFS BusID: 0000:09:00.0
  Hardware Class: graphics card
  Model: "Intel VGA compatible controller"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x56a0 
  SubVendor: pci 0x8086 "Intel Corporation"
  SubDevice: pci 0x1020 
  Revision: 0x08
  Driver: "i915"
  Driver Modules: "i915"
  Memory Range: 0xfb000000-0xfbffffff (rw,non-prefetchable)
  Memory Range: 0x7800000000-0x7bffffffff (ro,non-prefetchable)
  Memory Range: 0xfc000000-0xfc1fffff (ro,non-prefetchable,disabled)
  IRQ: 72 (1222354 events)
  Module Alias: "pci:v00008086d000056A0sv00008086sd00001020bc03sc00i00"
  Driver Info #0:
    Driver Status: i915 is active
    Driver Activation Cmd: "modprobe i915"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #43 (PCI bridge)

Primary display adapter: #11

cat /sys/module/i915/parameters/enable_hangcheck N

alexsin368 commented 1 year ago

I noticed you're running with Ubuntu 20.04. However, to run with Arc A770 or any Intel GPU, you need Ubuntu 22.04: https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html.

I created an environment using the software versions you mentioned, and I'm able to reproduce Example 1 (issue escalated to team) but not Examples 3 or 4. Example 2 could be related to your OS.

Please update your OS and retry all your examples.

ghost commented 1 year ago

I noticed you're running with Ubuntu 20.04. However, to run with Arc A770 or any Intel GPU, you need Ubuntu 22.04: https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html.

I'm running Arch Linux (I know, it's not an officially supported distro). I already have a Ubuntu 22.04 partition with latest oneapi and pytorch-gpu packages. Everything works correctly, except examples 1 and 2 (same bug?).

I opened this ticket because, I used to run example 3 on 2023.0.0 and it worked fine (or maybe was it the version before that one? Can't remember exactly) on Arch Linux. So it seems to me like there is some regression going on.

...

Regarding Native API errors, it would be nice if they could contain more info. What I know is that PI_ERROR_DEVICE_NOT_FOUND is a faux error. My device is always available. It works with OpenAI's Whisper, sycl-ls returns

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 5 5600 6-Core Processor               3.0 [2023.15.3.0.20_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.09.25812]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.25812]

I also compiled this example with dpcpp, and it found my device and worked fine.

This seems like a bug in Intel's stack somewhere related to timeout/hanging/locking.

Bottom line: it works on Ubuntu 22.04, that's for certain (minus examples 1 & 2, of course). In any case: I will wait for next release and report back (PI_ERROR_DEVICE_NOT_FOUND/PI_ERROR_COMMAND_EXECUTION_FAILURE are llvm/dpcpp related [?], they maybe should be transferred there if not fixed by next release?).

tye1 commented 1 year ago

@stacksmash76 This issue has the same root cause as https://github.com/intel/intel-extension-for-pytorch/issues/358. The MKL routine gesvd crashes with certain inputs. We are working with MKL team for fixing this.

alexsin368 commented 1 year ago

@stacksmash76 Our team has triaged and fixed this issue. It will be available in the next IPEX release.

alexsin368 commented 1 year ago

@stacksmash76 IPEX v2.0.110+xpu is now available: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu

Let us know if your issue is resolved using this version.

ghost commented 1 year ago

Thanks for the heads up. Currently I'm on vacation and will be able to revisit this in 2 weeks at the earliest.

ghost commented 1 year ago

@stacksmash76 IPEX v2.0.110+xpu is now available: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu

Let us know if your issue is resolved using this version.

I can confirm that this issue has been fixed. Thank you for your work.