Floating point exception (core dumped) in OpenCL and Level Zero in Ubuntu

intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

MIT License

1.15k stars 234 forks source link

Floating point exception (core dumped) in OpenCL and Level Zero in Ubuntu #613

Open jjfumero opened 1 year ago

jjfumero commented 1 year ago

Floating point exception (core dumped) in OpenCL and Level Zero when using both iGPU and discrete ARC GPU.

How to reproduce? I have enabled both Intel integrated GPU (Intel UHD Graphics 770) and Intel ARC 750 GPU.

When running level zero example: https://github.com/oneapi-src/level-zero/blob/master/samples/zello_world/zello_world.cpp

./bin/zello_world 
Driver initialized.
zelLoaderGetVersions number of components found: 1
... 
Congratulations, the device completed execution!
Floating point exception (core dumped)              <<<<

The error from DMESG:

[  375.561593] traps: zello_world[6097] trap divide error ip:7ffaf9c0edc3 sp:7fffd6ed12d0 error:0 in libze_intel_gpu.so.1.3.24595.35[7ffaf971f000+501000]

The same happens with OpenCL (e.g., running the typical clinfo program):

$ dmesh
[  643.015502] traps: clinfo[6344] trap divide error ip:7f89d5071103 sp:7ffcc7a26b40 error:0 in libigdrcl.so[7f89d4987000+6fb000]

I am using:

Compute runtime: 22.43.24595.35 OS: Ubuntu 22.04.1 LTS Kernel: 5.17.0-1019-oem

JablonskiMateusz commented 1 year ago

Hi @jjfumero Could you try newer kernel as described here https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html#step-2-install-linux-oem-kernel ?

jjfumero commented 1 year ago

Did not work. I could install the kernel linux-image-5.17.0-1020-oem but it did not recognize the ARC GPU. clinfo lists only the HD Graphics. I do have the Secure boot enabled and I noticed it did not prompt again to enroll the new key. Is there any way to force it?

JablonskiMateusz commented 1 year ago

please check in dmesg if i915 initialized the device properly

jjfumero commented 1 year ago

Kernel linux-image-5.17.0-1020-oem

Nothing related to i915 is displayed in dmesg:

$ sudo dmesg | grep i915

Clinfo:

$ clinfo 
Number of platforms                               0

Kernel linux-image-5.17.0-1019-oem

$ sudo dmesg | grep i915

[    6.587944] i915 0000:00:02.0: [drm] GT count: 1, enabled: 1
[    6.588626] i915 0000:00:02.0: [drm] VT-d active for gfx access
[    6.588632] fb0: switching to i915 from EFI VGA
[    6.588978] i915 0000:00:02.0: vgaarb: deactivate vga console
[    6.589040] i915 0000:00:02.0: [drm] Using Transparent Hugepages
[    6.589742] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    6.590532] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[    6.592042] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adls_dmc_ver2_01.bin (v2.1)
[    7.258638] i915 0000:00:02.0: [drm] [ENCODER:235:DDI A/PHY A] failed to retrieve link info, disabling eDP
[    7.268136] i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.5.4.bin version 70.5.4
[    7.268141] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9.3
[    7.283496] i915 0000:00:02.0: [drm] HuC authenticated
[    7.283508] i915 0000:00:02.0: [drm] GuC submission disabled
[    7.283512] i915 0000:00:02.0: [drm] GuC SLPC disabled
[    7.324260] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[    7.366002] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[    7.366345] i915 0000:03:00.0: [drm] GT count: 1, enabled: 1
[    7.367252] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    7.367274] i915 0000:03:00.0: [drm] Using Transparent Hugepages
[    7.367320] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000001fc000000
[    7.367321] i915 0000:03:00.0: [drm] Local memory available: 0x00000001fc000000
[    7.378074] fbcon: i915drmfb (fb0) is primary device
[    7.414962] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_07.bin (v2.7)
[    7.430491] i915 0000:03:00.0: [drm] GuC firmware i915/dg2_guc_70.5.4.bin version 70.5.4
[    7.430493] i915 0000:03:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[    7.443165] i915 0000:03:00.0: [drm] GuC submission enabled
[    7.443166] i915 0000:03:00.0: [drm] GuC SLPC enabled
[    7.443547] i915 0000:03:00.0: [drm] GuC RC: enabled
[    7.447578] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[    7.474784] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
[    7.476878] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[    7.477131] i915 0000:03:00.0: Could not add device for DVSEC id 2
[    8.835210] i915 0000:03:00.0: [drm] fb1: i915drmfb frame buffer device
[    8.878570] mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
[    8.878595] mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    8.879358] mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
[    8.879378] mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    8.881855] Creating 4 MTD partitions on "i915.spi.768":
[    8.881857] 0x000000000000-0x000000001000 : "i915.spi.768.DESCRIPTOR"
[    8.882761] 0x000000001000-0x0000005f0000 : "i915.spi.768.GSC"
[    8.885096] 0x0000005f0000-0x0000007f0000 : "i915.spi.768.OptionROM"
[    8.887271] 0x0000007f0000-0x000000800000 : "i915.spi.768.DAM"
[    9.282972] i915 0000:03:00.0: [drm] HuC authenticated
[    9.282977] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])

I can see both GPUs with clinfo but with the floating point exception error already described.

carlewis commented 1 year ago

I am facing the same issue. For my purpose I'm focusing on the output of sycl-ls command. I followed the instalation guide, included updated kernel 5.17.0-1020-oem.

One thing to note. If I run a docker container with access to the GPU the issue does not happen. Running the command with strace and comparing the output on container and host shows the error is triggered when the execution runs a syscall to munmap. This is the output for the failing command:

[pid 27187] munmap(0x7f97abae9000, 4096) = 0
[pid 27187] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7fff2ca2cd20) = 0
[pid 27187] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7fff2ca2cd80) = 0
[pid 27187] munmap(0x7f97abae7000, 4096) = 0
[pid 27187] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7fff2ca2cd20) = 0
[pid 27187] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7fff2ca2cd80) = 0
[pid 27187] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff2ca2cd50) = 0
[pid 27187] --- SIGFPE {si_signo=SIGFPE, si_code=FPE_INTDIV, si_addr=0x7f97aa04f2b3} ---
[pid 27188] <... futex resumed>)        = ?
[pid 27189] <... futex resumed>)        = ?
[pid 27189] +++ killed by SIGFPE (core dumped) +++
[pid 27188] +++ killed by SIGFPE (core dumped) +++
+++ killed by SIGFPE (core dumped) +++
Floating point exception (core dumped)

And this for the process running in a container:

[pid   268] munmap(0x7fe101806000, 4096) = 0
[pid   268] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid   268] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7ffde8463030) = 0
[pid   268] munmap(0x7fe101804000, 4096) = 0
[pid   268] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid   268] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7ffde8463030) = 0
[pid   268] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7ffde8463000) = 0
[pid   268] munmap(0x7fe101802000, 4096) = 0                                                      <<<-----
[pid   268] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid   268] ioctl(6, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7ffde8463030) = 0
[pid   268] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7ffde8463000) = 0
[pid   268] munmap(0x7fe101800000, 4096) = 0
[pid   268] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid   268] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7ffde8463000) = 0
[pid   268] munmap(0x7fe1017fe000, 4096) = 0
[pid   268] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0

I can provide more details if it helps.

eero-t commented 1 year ago

Are both the container and your host using the same version of the compute driver package, from the same repository (what apt policy says for them)?

And if you run the command under Gdb gdb ./bin/zello_world, what the bt command gives as backtrace?

jjfumero commented 1 year ago

Just an update. The floating point exception seems to be gone with the latest Compute Runtime: 22.53.25242.13

The Linux kernel I am using is still 5.17.0-1019-oem. The 5.17.0-1020-oem does not even boot in my case. I might do a fresh install in the near future though so I can report back if I still have the issue.

dkalpakchi commented 1 year ago

Hi!

I've just come across the very same bug (Floating point exception (core dumped)) using 5.17.0-1020-oem kernel and Intel Arc A770. I've installed everything following the guide here: https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html

clinfo reports the driver version 22.49.25018.23, so it's probably not the latest. Should I compile the latest from sources? If so, is there any documentation to follow?

dmesg reports the following

[  247.968269] traps: clinfo[4442] trap divide error ip:7fd49f70a2b3 sp:7ffc92dfd400 error:0 in libigdrcl.so[7fd49f08e000+68e000]

I've also tried running intel-extension-for-pytorch for a simple inference on a Transformer-based model and the fans were ramping like crazy (I really thought they will jump out of the case!) On that instance I got a similar trap divide errors in dmesg, which I also include here as those might originate from the same source as the clinfo problem.

[  981.122019] traps: python3[6234] trap divide error ip:7f7716a668d3 sp:7ffd39358bb0 error:0 in libze_intel_gpu.so.1.3.25018.23[7f7716593000+4e4000]
[ 1033.949548] traps: xpu-smi[6323] trap divide error ip:7f5aff2168d3 sp:7ffe96a070b0 error:0 in libze_intel_gpu.so.1.3.25018.23[7f5afed43000+4e4000]
[ 1046.942337] traps: xpu-smi[6384] trap divide error ip:7fdcd41318d3 sp:7ffd19657e30 error:0 in libze_intel_gpu.so.1.3.25018.23[7fdcd3c5e000+4e4000]
[ 1070.651746] traps: python3[6426] trap divide error ip:7fdfdd0fc8d3 sp:7fff158b8a30 error:0 in libze_intel_gpu.so.1.3.25018.23[7fdfdcc29000+4e4000]
[ 1088.444323] traps: xpu-smi[6489] trap divide error ip:7faaf0a3d8d3 sp:7ffdd252a130 error:0 in libze_intel_gpu.so.1.3.25018.23[7faaf056a000+4e4000]
[ 1317.335139] traps: clinfo[7361] trap divide error ip:7f57491962b3 sp:7ffede219240 error:0 in libigdrcl.so[7f5748b1a000+68e000]

I'm very new to Intel's computing ecosystem, so do let me know if I made some absolutely obvious mistakes or if you need more information. Looking forward to finding a solution for this!

eero-t commented 1 year ago

@dkalpakchi What kernel reports for the i915 GPU driver (is it successfully loaded): sudo dmesg | grep i915 ?

dkalpakchi commented 1 year ago

@eero-t Thanks for a swift reply! Here is the output of the command you suggested:

[    3.904252] i915 0000:00:02.0: [drm] GT count: 1, enabled: 1
[    3.904654] i915 0000:00:02.0: [drm] VT-d active for gfx access
[    3.904659] fb0: switching to i915 from EFI VGA
[    3.904761] i915 0000:00:02.0: vgaarb: deactivate vga console
[    3.904820] i915 0000:00:02.0: [drm] Using Transparent Hugepages
[    3.905288] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    3.906901] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[    3.907225] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/tgl_dmc_ver2_12.bin (v2.12)
[    3.913383] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[    3.913466] i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
[    3.941214] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[    3.975035] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[    3.975601] i915 0000:03:00.0: enabling device (0000 -> 0002)
[    3.975755] i915 0000:03:00.0: [drm] GT count: 1, enabled: 1
[    3.977079] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    3.977103] i915 0000:03:00.0: [drm] Using Transparent Hugepages
[    3.977148] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000003fa000000
[    3.977149] i915 0000:03:00.0: [drm] Local memory available: 0x00000003fa000000
[    4.003704] fbcon: i915drmfb (fb0) is primary device
[    4.046327] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_07.bin (v2.7)
[    4.052618] i915 0000:03:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[    4.052620] i915 0000:03:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[    4.062876] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[    4.066770] i915 0000:03:00.0: [drm] GuC submission enabled
[    4.066788] i915 0000:03:00.0: [drm] GuC SLPC enabled
[    4.067109] i915 0000:03:00.0: [drm] GuC RC: enabled
[    4.089113] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
[    4.090379] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[    4.090899] i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
[    4.090948] i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
[    4.090971] i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
[    4.101007] Creating 4 MTD partitions on "i915.spi.768":
[    4.101017] 0x000000000000-0x000000001000 : "i915.spi.768.DESCRIPTOR"
[    4.102032] 0x000000001000-0x0000005f0000 : "i915.spi.768.GSC"
[    4.102799] 0x0000005f0000-0x0000007f0000 : "i915.spi.768.OptionROM"
[    4.103605] 0x0000007f0000-0x000000800000 : "i915.spi.768.DAM"
[    4.107650] mei_gsc i915.mei-gscfi.768: cl:host=01 me=33 fw disconnect request received
[    4.107672] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: Could not read FW version ret = -19
[    4.107673] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: FW version command failed -5
[    4.514431] i915 0000:03:00.0: [drm] HuC authenticated
[    4.514435] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])

I don't see any immediate signs that the driver has failed to load, but maybe [ 4.107650] mei_gsc i915.mei-gscfi.768: cl:host=01 me=33 fw disconnect request received is an indication that it didn't?

eero-t commented 1 year ago

I think it's not a problem (at least related to this), and the traps could be something unrelated to GPU side.

What's your host CPU?

dkalpakchi commented 1 year ago

I'm using Intel Nuc 11 extreme kit with i7-11700B CPU

eero-t commented 1 year ago

I'm using Intel Nuc 11 extreme kit with i7-11700B CPU

TGL with integrated GPU should be fine...

22.49.25018.23

I tried following stack (which I built few months ago):

GMMlib: intel-gmmlib-22.3.3
SPIRV-SDK: sdk-1.3.236.0/sdk-1.3.236.0 (headers/tools)
SPIRV-LLVM: libllvmspirvlib-12-dev:amd64:12.0.0-3 (Ubuntu package)
OpenCL-Clang: libopencl-clang-12-dev:amd64:12.0.0-3 (Ubuntu package)
VC-intrinsics: v0.10.1
Graphics Compiler: igc-1.0.12662.1 (IGC)
Level-Zero API: v1.8.12
compute-runtime: 22.49.25018.21

With latest drm-tip kernel 6.3.0-rc3 version on TGL i7-11800H, and clinfo works fine.

=> issue may be specific to your setup.

Things that you could try next...

Check what valgrind[1] reports:

$ sudo apt install valgrind
$ valgrind clinfo

For pytorch, and in case issue does not reproduce under valgrind, try GDB:

$ sudo apt install gdb
$ gdb clinfo
...
(gdb) run
...
(gdb) bt

[1] https://valgrind.org/docs/manual/manual-intro.html#manual-intro.overview

dkalpakchi commented 1 year ago

Thanks for your suggestions! I have now tried running valgrind and it returns Integer divide by zero in intel-opencl/libigdrcl.so (full logs: valgrind_2023_03_30.txt).

Currently I'm running the following setup:

Level Zero -- 1.8.8+i524~u22.04
Level Zero GPU -- 1.3.25018.23+i554~22.04
intel-opencl-icd -- 22.49.25018.23+i554~22.04 (which I assume is the package where the compute-runtime comes from? but maybe I'm wrong?)
libopencl-clang-12-dev -- 12.0.0-3
libllvmspirvlib-12-dev -- 12.0.0-3
libigdgmm-dev -- 22.3.3+i550~22.04

All of the above are installed on Ubuntu 22.04 with 5.17.0-1020-oem kernel and the apt manager doesn't suggest that there are any updates to these packages.

Regarding 6.3.0-rc3 kernel, I haven't seen any mention of any newer kernel on the guide page (https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html), so should I compile that from scratch? Is it also Intel's oem kernel, or is it just a regular one?

I don't have SPIRV packages installed, because I don't know what they are for.

It might be that this issue is specific to the Level Zero version or to the version of the compute-runtime that I'm using, but as far as I got, those versions seems to be recommended for Intel's extensions for PyTorch (https://dgpu-docs.intel.com/releases/stable_540_20221205.html)

Please let me know if you have any thoughts on how I could proceed further.

eero-t commented 1 year ago

intel-opencl-icd -- 22.49.25018.23+i554~22.04 (which I assume is the package where the compute-runtime comes from? but maybe I'm wrong?)

Yes, that's the OpenCL GPU backend. But compute-runtime project provides also level-zero-gpu backend.

Regarding 6.3.0-rc3 kernel, I haven't seen any mention of any newer kernel on the guide page (https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html), so should I compile that from scratch? Is it also Intel's oem kernel, or is it just a regular one?

DKMS GPU modules support (are API compatible with) only specific kernel versions.

Which ones, is documented here: https://github.com/intel-gpu/intel-gpu-i915-backports

(Latest kernel support comes through upstream.)

I don't have SPIRV packages installed, because I don't know what they are for.

It might be that this issue is specific to the Level Zero version or to the version of the compute-runtime that I'm using, but as far as I got, those versions seems to be recommended for Intel's extensions for PyTorch (https://dgpu-docs.intel.com/releases/stable_540_20221205.html)

FYI: Strictly speaking, clinfo does not use level-zero. While OpenCL and Level-Zero API backend implementations (coming from IGC and compute-runtime projects) share lot of code, they are separate stacks with their own frontend and backend libraries.

Thanks for your suggestions! I have now tried running valgrind and it returns Integer divide by zero in intel-opencl/libigdrcl.so (full logs: valgrind_2023_03_30.txt).

Dang, that backtrace does not list any symbols.

Please let me know if you have any thoughts on how I could proceed further.

Compute-runtime devs need to look into that now, I don't think I can do more for this (I'm not dev in this project)...