Open jjfumero opened 1 year ago
Hi @jjfumero Could you try newer kernel as described here https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html#step-2-install-linux-oem-kernel ?
Did not work. I could install the kernel linux-image-5.17.0-1020-oem
but it did not recognize the ARC GPU.
clinfo
lists only the HD Graphics.
I do have the Secure boot enabled and I noticed it did not prompt again to enroll the new key. Is there any way to force it?
please check in dmesg if i915 initialized the device properly
Nothing related to i915
is displayed in dmesg
:
$ sudo dmesg | grep i915
Clinfo:
$ clinfo
Number of platforms 0
$ sudo dmesg | grep i915
[ 6.587944] i915 0000:00:02.0: [drm] GT count: 1, enabled: 1
[ 6.588626] i915 0000:00:02.0: [drm] VT-d active for gfx access
[ 6.588632] fb0: switching to i915 from EFI VGA
[ 6.588978] i915 0000:00:02.0: vgaarb: deactivate vga console
[ 6.589040] i915 0000:00:02.0: [drm] Using Transparent Hugepages
[ 6.589742] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 6.590532] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[ 6.592042] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adls_dmc_ver2_01.bin (v2.1)
[ 7.258638] i915 0000:00:02.0: [drm] [ENCODER:235:DDI A/PHY A] failed to retrieve link info, disabling eDP
[ 7.268136] i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.5.4.bin version 70.5.4
[ 7.268141] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9.3
[ 7.283496] i915 0000:00:02.0: [drm] HuC authenticated
[ 7.283508] i915 0000:00:02.0: [drm] GuC submission disabled
[ 7.283512] i915 0000:00:02.0: [drm] GuC SLPC disabled
[ 7.324260] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[ 7.366002] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[ 7.366345] i915 0000:03:00.0: [drm] GT count: 1, enabled: 1
[ 7.367252] i915 0000:03:00.0: [drm] VT-d active for gfx access
[ 7.367274] i915 0000:03:00.0: [drm] Using Transparent Hugepages
[ 7.367320] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000001fc000000
[ 7.367321] i915 0000:03:00.0: [drm] Local memory available: 0x00000001fc000000
[ 7.378074] fbcon: i915drmfb (fb0) is primary device
[ 7.414962] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_07.bin (v2.7)
[ 7.430491] i915 0000:03:00.0: [drm] GuC firmware i915/dg2_guc_70.5.4.bin version 70.5.4
[ 7.430493] i915 0000:03:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[ 7.443165] i915 0000:03:00.0: [drm] GuC submission enabled
[ 7.443166] i915 0000:03:00.0: [drm] GuC SLPC enabled
[ 7.443547] i915 0000:03:00.0: [drm] GuC RC: enabled
[ 7.447578] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[ 7.474784] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
[ 7.476878] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[ 7.477131] i915 0000:03:00.0: Could not add device for DVSEC id 2
[ 8.835210] i915 0000:03:00.0: [drm] fb1: i915drmfb frame buffer device
[ 8.878570] mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
[ 8.878595] mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[ 8.879358] mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
[ 8.879378] mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[ 8.881855] Creating 4 MTD partitions on "i915.spi.768":
[ 8.881857] 0x000000000000-0x000000001000 : "i915.spi.768.DESCRIPTOR"
[ 8.882761] 0x000000001000-0x0000005f0000 : "i915.spi.768.GSC"
[ 8.885096] 0x0000005f0000-0x0000007f0000 : "i915.spi.768.OptionROM"
[ 8.887271] 0x0000007f0000-0x000000800000 : "i915.spi.768.DAM"
[ 9.282972] i915 0000:03:00.0: [drm] HuC authenticated
[ 9.282977] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
I can see both GPUs with clinfo
but with the floating point exception
error already described.
I am facing the same issue. For my purpose I'm focusing on the output of sycl-ls
command. I followed the instalation guide, included updated kernel 5.17.0-1020-oem.
One thing to note. If I run a docker container with access to the GPU the issue does not happen. Running the command with strace
and comparing the output on container and host shows the error is triggered when the execution runs a syscall to munmap
. This is the output for the failing command:
[pid 27187] munmap(0x7f97abae9000, 4096) = 0
[pid 27187] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7fff2ca2cd20) = 0
[pid 27187] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7fff2ca2cd80) = 0
[pid 27187] munmap(0x7f97abae7000, 4096) = 0
[pid 27187] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7fff2ca2cd20) = 0
[pid 27187] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7fff2ca2cd80) = 0
[pid 27187] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff2ca2cd50) = 0
[pid 27187] --- SIGFPE {si_signo=SIGFPE, si_code=FPE_INTDIV, si_addr=0x7f97aa04f2b3} ---
[pid 27188] <... futex resumed>) = ?
[pid 27189] <... futex resumed>) = ?
[pid 27189] +++ killed by SIGFPE (core dumped) +++
[pid 27188] +++ killed by SIGFPE (core dumped) +++
+++ killed by SIGFPE (core dumped) +++
Floating point exception (core dumped)
And this for the process running in a container:
[pid 268] munmap(0x7fe101806000, 4096) = 0
[pid 268] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid 268] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7ffde8463030) = 0
[pid 268] munmap(0x7fe101804000, 4096) = 0
[pid 268] ioctl(5, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid 268] ioctl(5, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7ffde8463030) = 0
[pid 268] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7ffde8463000) = 0
[pid 268] munmap(0x7fe101802000, 4096) = 0 <<<-----
[pid 268] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid 268] ioctl(6, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, 0x7ffde8463030) = 0
[pid 268] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7ffde8463000) = 0
[pid 268] munmap(0x7fe101800000, 4096) = 0
[pid 268] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
[pid 268] ioctl(6, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7ffde8463000) = 0
[pid 268] munmap(0x7fe1017fe000, 4096) = 0
[pid 268] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0x7ffde8462fe0) = 0
I can provide more details if it helps.
Are both the container and your host using the same version of the compute driver package, from the same repository (what apt policy
says for them)?
And if you run the command under Gdb gdb ./bin/zello_world
, what the bt
command gives as backtrace?
Just an update. The floating point exception seems to be gone with the latest Compute Runtime: 22.53.25242.13
The Linux kernel I am using is still 5.17.0-1019-oem
. The 5.17.0-1020-oem
does not even boot in my case. I might do a fresh install in the near future though so I can report back if I still have the issue.
Hi!
I've just come across the very same bug (Floating point exception (core dumped)
) using 5.17.0-1020-oem
kernel and Intel Arc A770. I've installed everything following the guide here: https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html
clinfo
reports the driver version 22.49.25018.23
, so it's probably not the latest. Should I compile the latest from sources? If so, is there any documentation to follow?
dmesg
reports the following
[ 247.968269] traps: clinfo[4442] trap divide error ip:7fd49f70a2b3 sp:7ffc92dfd400 error:0 in libigdrcl.so[7fd49f08e000+68e000]
I've also tried running intel-extension-for-pytorch
for a simple inference on a Transformer-based model and the fans were ramping like crazy (I really thought they will jump out of the case!) On that instance I got a similar trap divide errors in dmesg
, which I also include here as those might originate from the same source as the clinfo problem.
[ 981.122019] traps: python3[6234] trap divide error ip:7f7716a668d3 sp:7ffd39358bb0 error:0 in libze_intel_gpu.so.1.3.25018.23[7f7716593000+4e4000]
[ 1033.949548] traps: xpu-smi[6323] trap divide error ip:7f5aff2168d3 sp:7ffe96a070b0 error:0 in libze_intel_gpu.so.1.3.25018.23[7f5afed43000+4e4000]
[ 1046.942337] traps: xpu-smi[6384] trap divide error ip:7fdcd41318d3 sp:7ffd19657e30 error:0 in libze_intel_gpu.so.1.3.25018.23[7fdcd3c5e000+4e4000]
[ 1070.651746] traps: python3[6426] trap divide error ip:7fdfdd0fc8d3 sp:7fff158b8a30 error:0 in libze_intel_gpu.so.1.3.25018.23[7fdfdcc29000+4e4000]
[ 1088.444323] traps: xpu-smi[6489] trap divide error ip:7faaf0a3d8d3 sp:7ffdd252a130 error:0 in libze_intel_gpu.so.1.3.25018.23[7faaf056a000+4e4000]
[ 1317.335139] traps: clinfo[7361] trap divide error ip:7f57491962b3 sp:7ffede219240 error:0 in libigdrcl.so[7f5748b1a000+68e000]
I'm very new to Intel's computing ecosystem, so do let me know if I made some absolutely obvious mistakes or if you need more information. Looking forward to finding a solution for this!
@dkalpakchi What kernel reports for the i915 GPU driver (is it successfully loaded):
sudo dmesg | grep i915
?
@eero-t Thanks for a swift reply! Here is the output of the command you suggested:
[ 3.904252] i915 0000:00:02.0: [drm] GT count: 1, enabled: 1
[ 3.904654] i915 0000:00:02.0: [drm] VT-d active for gfx access
[ 3.904659] fb0: switching to i915 from EFI VGA
[ 3.904761] i915 0000:00:02.0: vgaarb: deactivate vga console
[ 3.904820] i915 0000:00:02.0: [drm] Using Transparent Hugepages
[ 3.905288] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 3.906901] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[ 3.907225] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/tgl_dmc_ver2_12.bin (v2.12)
[ 3.913383] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[ 3.913466] i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
[ 3.941214] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[ 3.975035] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[ 3.975601] i915 0000:03:00.0: enabling device (0000 -> 0002)
[ 3.975755] i915 0000:03:00.0: [drm] GT count: 1, enabled: 1
[ 3.977079] i915 0000:03:00.0: [drm] VT-d active for gfx access
[ 3.977103] i915 0000:03:00.0: [drm] Using Transparent Hugepages
[ 3.977148] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000003fa000000
[ 3.977149] i915 0000:03:00.0: [drm] Local memory available: 0x00000003fa000000
[ 4.003704] fbcon: i915drmfb (fb0) is primary device
[ 4.046327] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_07.bin (v2.7)
[ 4.052618] i915 0000:03:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[ 4.052620] i915 0000:03:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[ 4.062876] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[ 4.066770] i915 0000:03:00.0: [drm] GuC submission enabled
[ 4.066788] i915 0000:03:00.0: [drm] GuC SLPC enabled
[ 4.067109] i915 0000:03:00.0: [drm] GuC RC: enabled
[ 4.089113] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
[ 4.090379] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[ 4.090899] i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
[ 4.090948] i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
[ 4.090971] i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
[ 4.101007] Creating 4 MTD partitions on "i915.spi.768":
[ 4.101017] 0x000000000000-0x000000001000 : "i915.spi.768.DESCRIPTOR"
[ 4.102032] 0x000000001000-0x0000005f0000 : "i915.spi.768.GSC"
[ 4.102799] 0x0000005f0000-0x0000007f0000 : "i915.spi.768.OptionROM"
[ 4.103605] 0x0000007f0000-0x000000800000 : "i915.spi.768.DAM"
[ 4.107650] mei_gsc i915.mei-gscfi.768: cl:host=01 me=33 fw disconnect request received
[ 4.107672] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: Could not read FW version ret = -19
[ 4.107673] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: FW version command failed -5
[ 4.514431] i915 0000:03:00.0: [drm] HuC authenticated
[ 4.514435] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
I don't see any immediate signs that the driver has failed to load, but maybe [ 4.107650] mei_gsc i915.mei-gscfi.768: cl:host=01 me=33 fw disconnect request received
is an indication that it didn't?
I think it's not a problem (at least related to this), and the traps could be something unrelated to GPU side.
What's your host CPU?
I'm using Intel Nuc 11 extreme kit with i7-11700B CPU
I'm using Intel Nuc 11 extreme kit with i7-11700B CPU
TGL with integrated GPU should be fine...
22.49.25018.23
I tried following stack (which I built few months ago):
With latest drm-tip
kernel 6.3.0-rc3
version on TGL i7-11800H, and clinfo
works fine.
=> issue may be specific to your setup.
Things that you could try next...
Check what valgrind[1] reports:
$ sudo apt install valgrind
$ valgrind clinfo
For pytorch, and in case issue does not reproduce under valgrind, try GDB:
$ sudo apt install gdb
$ gdb clinfo
...
(gdb) run
...
(gdb) bt
[1] https://valgrind.org/docs/manual/manual-intro.html#manual-intro.overview
Thanks for your suggestions! I have now tried running valgrind and it returns Integer divide by zero
in intel-opencl/libigdrcl.so
(full logs: valgrind_2023_03_30.txt).
Currently I'm running the following setup:
All of the above are installed on Ubuntu 22.04 with 5.17.0-1020-oem kernel and the apt manager doesn't suggest that there are any updates to these packages.
Regarding 6.3.0-rc3 kernel, I haven't seen any mention of any newer kernel on the guide page (https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html), so should I compile that from scratch? Is it also Intel's oem kernel, or is it just a regular one?
I don't have SPIRV packages installed, because I don't know what they are for.
It might be that this issue is specific to the Level Zero version or to the version of the compute-runtime that I'm using, but as far as I got, those versions seems to be recommended for Intel's extensions for PyTorch (https://dgpu-docs.intel.com/releases/stable_540_20221205.html)
Please let me know if you have any thoughts on how I could proceed further.
- intel-opencl-icd -- 22.49.25018.23+i554~22.04 (which I assume is the package where the compute-runtime comes from? but maybe I'm wrong?)
Yes, that's the OpenCL GPU backend. But compute-runtime project provides also level-zero-gpu backend.
Regarding 6.3.0-rc3 kernel, I haven't seen any mention of any newer kernel on the guide page (https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html), so should I compile that from scratch? Is it also Intel's oem kernel, or is it just a regular one?
DKMS GPU modules support (are API compatible with) only specific kernel versions.
Which ones, is documented here: https://github.com/intel-gpu/intel-gpu-i915-backports
(Latest kernel support comes through upstream.)
I don't have SPIRV packages installed, because I don't know what they are for.
It might be that this issue is specific to the Level Zero version or to the version of the compute-runtime that I'm using, but as far as I got, those versions seems to be recommended for Intel's extensions for PyTorch (https://dgpu-docs.intel.com/releases/stable_540_20221205.html)
FYI: Strictly speaking, clinfo
does not use level-zero. While OpenCL and Level-Zero API backend implementations (coming from IGC and compute-runtime projects) share lot of code, they are separate stacks with their own frontend and backend libraries.
Thanks for your suggestions! I have now tried running valgrind and it returns
Integer divide by zero
inintel-opencl/libigdrcl.so
(full logs: valgrind_2023_03_30.txt).
Dang, that backtrace does not list any symbols.
Please let me know if you have any thoughts on how I could proceed further.
Compute-runtime devs need to look into that now, I don't think I can do more for this (I'm not dev in this project)...
Floating point exception (core dumped) in OpenCL and Level Zero when using both iGPU and discrete ARC GPU.
How to reproduce? I have enabled both Intel integrated GPU (Intel UHD Graphics 770) and Intel ARC 750 GPU.
When running level zero example: https://github.com/oneapi-src/level-zero/blob/master/samples/zello_world/zello_world.cpp
The error from DMESG:
The same happens with OpenCL (e.g., running the typical
clinfo
program):I am using:
Compute runtime: 22.43.24595.35 OS: Ubuntu 22.04.1 LTS Kernel: 5.17.0-1019-oem