intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.1k stars 229 forks source link

Arc GPU not showing up in clinfo nor sycl-ls on Linux #714

Closed mkottman closed 3 months ago

mkottman commented 3 months ago

Recently (I can't pinpoint exactly when in the past month) my "Intel(R) Arc(TM) A750 Graphics" card no longer shows up in the output of clinfo nor sycl-ls on my Arch Linux install. With the default intel-compute-runtime package, I get the following output:

$ clinfo
Number of platforms                               0

I installed the latest available intel-oneapi-basekit and setting it up adds more platforms, but the GPU is still missing:

$ . /opt/intel/oneapi/2024.0/oneapi-vars.sh
$ clinfo -l
Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM)
 `-- Device #0: Intel(R) FPGA Emulation Device
Platform #1: Intel(R) OpenCL
 `-- Device #0: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]

When running strace clinfo I can see the card being evaluated ( strace output ) but it does not show up in the output.

Kernel I'm running:

$ uname -a
Linux hostname 6.8.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 16 Mar 2024 17:15:35 +0000 x86_64 GNU/Linux

I ended up getting and compiling this runtime with debug info and tracing through the OpenCL loader to see where it's being "dropped". It turns out that here https://github.com/intel/compute-runtime/blob/e44ac2a0017434b2af6fdf5601d98975640e781e/shared/source/os_interface/linux/drm_memory_manager.cpp#L102 the card is being dropped, as the value of gpuAddressSpace is equal to 281474976645119, which is 111111111111111111111111111111101111111111111111 in binary (notice the 0?), and does not match any of the branches in https://github.com/intel/compute-runtime/blob/master/shared/source/memory_manager/gfx_partition.cpp

When I hardcode the value 0xffffffffffff as the address space (all ones) and use the modified libigdrcl.so as a new vendor in /etc/OpenCL/vendors, I can now see the card in clinfo and sycl-ls output and can successfully run OpenCL programs, like memtestCL:

$ clinfo -l
Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM)
 `-- Device #0: Intel(R) FPGA Emulation Device
Platform #1: Intel(R) OpenCL
 `-- Device #0: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Platform #2: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A750 Graphics
$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [24.13.0]
$ memtestCL 7700 10
10 iterations over 7700 MiB of memory on device Intel(R) Arc(TM) A750 Graphics
...
Final error count: 0 errors

Why would the gpuAddressSpace end up with an "incorrect" value, could it mean a hardware failure? I don't see any issues with the card, e.g. using memtest or playing games.

JablonskiMateusz commented 3 months ago

Hi @mkottman it looks like duplicate of #710

mkottman commented 3 months ago

Indeed, it looks like this is the case, and the workaround from https://github.com/intel/compute-runtime/issues/710#issuecomment-2002646557 works for me!