intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.13k stars 232 forks source link

NEO driver not detect GPU when using kernel 6.8.x. #710

Closed ionutnechita-intel closed 5 months ago

ionutnechita-intel commented 7 months ago

NEO driver is not detect for GPU when using kernel 6.8.x.

When have kernel 6.5.x and 6.6.x this is present.

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [24.05.28454.6]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28454]

And on kernel 6.8.x have this:

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
eero-t commented 7 months ago

I can reproduce this with latest drm-tip 6.8.0-rc6 kernel, using earlier built (2024-02-09) compute-runtime master branch, or earlier compute-runtime releases => Neither clinfo nor zello_sysman recognizes the GPU. vainfo / vpl-inspect media tools still recognize the GPU though, so it's compute stack specific issue.

I do not see any difference in strace output (between old an new kernels) before compute-runtime decides to give up, so it's a bit mystery why it decides not to recognize the GPU.

ionutnechita-intel commented 7 months ago

Thank you for reproduced this.

On 6.7.x, GPU is recognized. Only 6.8.x is not recognized.

eero-t commented 7 months ago

Yes, it works with 6.7 (drm-tip) kernel also for me, just not with 6.8 (i915 KMD).

EDIT: that was with public Xe KMD repo, not drm-tip. With drm-tip, the issue is already with earlier kernel version (see below).

ionutnechita-intel commented 7 months ago

I tested with 6.8.0-rc1(6.8.0-060800rc1-generic) and this issue is reproduced.

Maybe between 6.7 and 6.8.0-rc1 appear this issue.

I notice several commits with new Xe Intel driver and fixing eDP/DisplayPort in 6.8.0-rc1.

I not have time to bisect for detect what commit/commits cause this behaviour.

eero-t commented 7 months ago

Dang. I was comparing "drm-tip" on TGL against "xe-drm-next" kernel on DG1, but their i915 KMD codes seem to progress at different rates, so I had to do quick bisection using already existing nightly "drm-tip" builds...

While things work still with 6.7 version of "xe-drm-next" kernel repo, with the "drm-tip" repo kernel, clinfo & zello_sysman actually broke already earlier, somewhere between couple of "drm-tip" repo upstream 6.6-rc7 kernel integration changes:

(Commits named like those, or the original commits are not any more in "drm-tip" repo, as it gets constantly rebased to upstream, so I cannot provide list of commits between them any more.)

JablonskiMateusz commented 7 months ago

Hi folks, we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE. As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

eero-t commented 7 months ago

we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE.

Media and 3D drivers seem to work fine with that change, why it's a problem for L0/compute stack?

(I'm wondering whether this change should be reported to upstream as kernel stable ABI breakage...)

Looking at the compute-runtime code, it seems to affect SVM capability & address space size: https://github.com/intel/compute-runtime/blob/master/shared/source/os_interface/linux/product_helper_drm.cpp#L128

Where's in Mesa code: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/vulkan/anv_device.c#L2300

eero-t commented 7 months ago

As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

Yes, with those both clinfo & zello_sysman work just fine (on TGL-H iGPU).

ionutnechita-intel commented 7 months ago

Hi @eero-t,

Using latest drm-tip version with variable in environment, GPU appear.

# /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
# NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [23.13.026032]
[opencl:cpu:2] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:3] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.26032]
# uname -a
Linux 6.8.0-rc6-lowlatency1 #1 SMP PREEMPT_DYNAMIC Fri Mar  1 09:38:45 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
# lscpu | grep "Model name"
Model name:                         11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz
ionutnechita-intel commented 7 months ago

In this case issue is from Kernel or NEO driver/OpenCL?

eero-t commented 7 months ago

Well, it depends the GTT size value returned by the KMD is thought to be part of stable ABI, but I do not see how it could be, as there can be different reasons for those values to differ. I would think that NEO should accept / adapt to sensible GTT size values, potentially with a warning when it differs from expected, instead of barfing out when it's not exactly matching its expectations.

eero-t commented 7 months ago

Tested 6.8.0-rc3 based Xe KMD, and compute/Sysman driver worked with that, so this issue seems to be i915 KMD specific (as expected).

obj-obj commented 7 months ago

I can reproduce this on Arch

Disty0 commented 6 months ago

I can reproduce this on Arch with Linux 6.8 release (6.8.1-arch1-1) using i915. Haven't tried xe yet.

Exporting these works fine:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48
ionutnechita-intel commented 6 months ago

In this case, will the NEO compute driver have adaptation to working on new behaviour?

DX37 commented 6 months ago

Encountered this issue also.

Mstrodl commented 6 months ago

On 6.8:

gpuAddressSpace = 281474976706559
= 111111111111111111111111111111111110111111111111

On 6.7:

gpuAddressSpace = 281474976710655
 = 111111111111111111111111111111111111111111111111

The issue seems to lie here: https://github.com/intel/compute-runtime/blob/03078541d7bcfdf2b669a07410e5a7bacf436c63/shared/source/memory_manager/gfx_partition.cpp#L250-L253

eero-t commented 6 months ago

In this case, will the NEO compute driver have adaptation to working on new behaviour?

It seems that change in value reported by the GTT size ioctl() may be reverted in i915 kernel driver: https://patchwork.freedesktop.org/series/131095/

(I.e. KMD would only internally use the "usable" GTT size value, and report full address space to user space, including the reserved parts, and distros using 6.8.0 kernel need to patch their kernels until upstream releases updated kernel.)

@JablonskiMateusz Maybe compute-runtime could do some BAT tests also with latest drm-tip kernel, to catch such changes before they are sent to upstream kernel? This change was in drm-tip repo i915 KMD already in 6.7...

nyanmisaka commented 6 months ago

Note that the upcoming Ubuntu 24.04 LTS uses the non-LTS 6.8 kernel. Hopefully it can be fixed before it's released next month. Otherwise OpenCL will not be available on many distros based on it.

ionutnechita-intel commented 6 months ago

Thanks

obj-obj commented 6 months ago

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

nyanmisaka commented 6 months ago

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

rusticl is still an experimental implementation and according to Mesa it is currently broken on Arc GPUs. My use case is video processing and only NEO supports zero-copy interop between VA-API and OpenCL through cl_intel_va_api_media_sharing.

TimoVerbrugghe commented 6 months ago

Just adding as well that I'm also experiencing this issue on nixos when running the latest kernel (6.8.1). GPU (intel N100 alder lake) does not show up in clinfo.

However, on a N5105 machine (Jasper Lake), the GPU did get detected by clinfo on the latest kernel.

However downgrading to 6.7.10 on the N100 machine immediately resolved the issue.

JablonskiMateusz commented 6 months ago

Good news folks, we are going to adjust the logic on UMD side so we can accept new gtt size reported by i915 ;)

ionutnechita-intel commented 6 months ago

This is good news.

JablonskiMateusz commented 6 months ago

could you retry with neo built with this commit https://github.com/intel/compute-runtime/commit/420e1391b228586efa8546db343e8e6eb50e398b?

joanbm commented 6 months ago

could you retry with neo built with this commit 420e139?

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

eero-t commented 6 months ago

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

FYI: @tjaalton Ubuntu 24.04 LTS is also having a 6.8+ kernel, so its compute-runtime packages needs this too.

ionutnechita-intel commented 5 months ago

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Tested with: Ubuntu 24.04 Alpha. Linux Kernel 6.8.4-lowlatency. TGL: 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz

Disty0 commented 5 months ago

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU. It detects and tries to run on the GPU but gets stuck with 100% single CPU core usage. Happens on any OpenCL or SYCL app. (Kernel 6.8 is using the workaround provided in this thread.)

You can downgrade to Linux 6.8.4 for Arch Linux with these packages: linux 6.8.4: https://archive.archlinux.org/packages/l/linux/linux-6.8.4.arch1-1-x86_64.pkg.tar.zst linux-headers 6.8.4: https://archive.archlinux.org/packages/l/linux-headers/linux-headers-6.8.4.arch1-1-x86_64.pkg.tar.zst

eero-t commented 5 months ago

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.

@Disty0 If issue happens also with 6.6 kernel, I do not think it to be related to this issue => please file a separate one, and report also compute-runtime version, and where perf reports CPU usage to happen (run as root):

# perf record -a
<wait a min or two>
^C
# perf report -n
eero-t commented 5 months ago

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly newer tag includes actual fix: https://github.com/intel/compute-runtime/compare/24.09.28717.12...24.09.28717.14

chao-camect commented 5 months ago

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly new tag includes actual fix: 24.09.28717.12...24.09.28717.14

Right. I was trying to see why 24.09.28717.12 still didn't work for me and read your reply. Thanks. This saved me time.

tjaalton commented 5 months ago

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

FYI: @tjaalton Ubuntu 24.04 LTS is also having a 6.8+ kernel, so its compute-runtime packages needs this too.

uploaded the fix to noble, thanks for the ping

Disty0 commented 5 months ago

This issue seems to be fixed with aur/intel-compute-runtime-bin 24.13.29138.7-1 on my end. (Arch Linux 6.8.4)

JablonskiMateusz commented 5 months ago

since issue seems to be fixed, can we now close the issue?

ionutnechita-intel commented 5 months ago

Hello @JablonskiMateusz ,

I think this issue is fixed now.

Maybe is fine to close this ticket.

simonlui commented 5 months ago

@ionutnechita-intel Sorry, but this doesn't work inside an OCI container with podman for whatever reason. Not sure if it is also an issue with Docker but I would presume it would be a problem as well. You have to export the two environment variables NEOReadDebugKeys=1 and OverrideGpuAddressSpace=48 for the GPU to be seen inside the container but not on the host machine. I don't know if you want to consider it the same bug but if not, I can open a new bug report for this.

joanbm commented 5 months ago

@simonlui Are you sure that the version of the Intel Compute Runtime installed inside the container contains the fix? I can imagine your situation happening if this were not the case. For reference, my iGPU appears to be correctly detected by clinfo inside an Arch Linux-based container.

simonlui commented 5 months ago

@joanbm Yeah that was it. I was confused why I was hitting this in the oneapi-basekit Docker image but it was last updated a month ago at the time of writing this so it makes sense why it still had the issue without the updated version of the runtime inside the container.

mattcurf commented 5 months ago

@JablonskiMateusz When will this fix be posted to the apt repo at https://repositories.intel.com/gpu/ubuntu?

ionutnechita-intel commented 4 months ago

Hi @simonlui,

I understand what you are saying. but it must be checked more thoroughly, with several OS variants as a container.

I tested it on Ubuntu 24.04, directly on the physical machine, with the latest update, and I didn't see the problem anymore.

simonlui commented 4 months ago

@ionutnechita-intel The problem was fixed, it was an outdated compute runtime package inside the oneapi-basekit Docker image which didn't have the updated runtime installed by default. Updating the package manually fixed the issue.

ionutnechita-intel commented 4 months ago

Hi @simonlui,

Thank you for feedback.

A good day.

sumseq commented 4 weeks ago

I am having the same issue with Rocky Linux. When I upgraded from 9.2 to 9.4, I can no longer see the Arc GPU in the clinfo. I see my Arc 750 in "lspci" but not in clinfo and I cannot run codes on it. My username is part of the "render" group and I have the Redhat 9.3 driver installed (the latest one I could find) along with OneAPI HPC toolkit 2024.2.

If I use the two environment variables above, it works! (this is the first fix I have found).

Will this be fixed in the next driver release that supports RHEL 9.4?