intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.13k stars 232 forks source link

GPU does not show up as OpenCL device when logged in over SSH, unless you login locally #701

Open ProjectPhysX opened 8 months ago

ProjectPhysX commented 8 months ago

On a fresh Ubuntu Server 23.04 installation (kernel 6.5), after installing NEO and rebooting, when accessing the machine remotely over SSH, the GPU (Arc A770) does not show up as OpenCL device. Only when I locally login at the PC, the GPU immediately shows up as OpenCL device both locally and in the remote terminal.

JablonskiMateusz commented 8 months ago

Hi @ProjectPhysX Could you run command strace -o strace.log clinfo and share produced strace.log file?

ProjectPhysX commented 8 months ago

Hi @JablonskiMateusz,

here is strace-before-local-login.log, and visible devices are:

| Device ID    0 | NVIDIA TITAN Xp                                            |
| Device ID    1 | 13th Gen Intel(R) Core(TM) i7-13700K                       |
| Device ID    2 | Intel(R) FPGA Emulation Device                             |

After logging in locally on the PC, here is strace-after-local-login.log, and visible devices are:

| Device ID    0 | Intel(R) Arc(TM) A770 Graphics                             |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
| Device ID    2 | NVIDIA TITAN Xp                                            |
| Device ID    3 | 13th Gen Intel(R) Core(TM) i7-13700K                       |
| Device ID    4 | Intel(R) FPGA Emulation Device                             |

Kind regards, Moritz

JablonskiMateusz commented 8 months ago

@ProjectPhysX from logs it looks like in the first log you don't have permission to gpu file:

openat(AT_FDCWD, "/dev/dri/by-path/pci-0000:00:02.0-render", O_RDWR|O_CLOEXEC) = -1 EACCES (Permission denied)

Please ensure that user you are using is a member of group render

ProjectPhysX commented 8 months ago

Hi @JablonskiMateusz,

thanks a lot for the help! An additional sudo usermod -a -G render $(whoami) fixes the issue. Please make the installation fix the file permissions or automatically put the user in the render group, and/or include this line in the intallation instructions.

Kind regards, Moritz

bashbaug commented 8 months ago

@JablonskiMateusz, out of curiosity why does logging in locally "fix" this issue?

JablonskiMateusz commented 8 months ago

@ProjectPhysX

In our readme we have following line:

To allow NEO access to GPU device make sure user has permissions to files /dev/dri/renderD*.

btw.

out of curiosity why does logging in locally "fix" this issue?

@ProjectPhysX when you logged locally, was it the same user as when you logged over ssh?

ProjectPhysX commented 8 months ago

@JablonskiMateusz yes, same user. The local login alone triggers the GPU to become visible as OpenCL device. Why can't the installation set the user access rights? Miss this detail and devices won't show up without any error, that's not user-friendly.

eero-t commented 7 months ago

thanks a lot for the help! An additional sudo usermod -a -G render $(whoami) fixes the issue. Please make the installation fix the file permissions or automatically put the user in the render group,

It's (definitely) not the driver (package) responsibility to do things like that.

and/or include this line in the intallation instructions.

Yes, that's a good idea. In which all documents you think this should be mentioned?

@JablonskiMateusz yes, same user. The local login alone triggers the GPU to become visible as OpenCL device.

As to what happens when you do graphical login locally... Your GUI session manager grants authenticated user (temporary) access to the display device. Otherwise user's GUI would not work that well (as it would fall back to CPU rendering, or even fail).

ProjectPhysX commented 7 months ago

Yes, that's a good idea. In which all documents you think this should be mentioned?

Here in the Readme and in the "Installation procedure" in release notes would be good. Thanks!

eero-t commented 7 months ago

An additional sudo usermod -a -G render $(whoami) fixes the issue.

Older (e.g. Ubuntu) distro versions do not have render group => it's better to use Intel device group ID directly.

In case host has also non-Intel DRM devices (with different group IDs), Intel GPU device file names can be gotten with following: grep -l 0x8086 /sys/class/drm/renderD*/device/vendor | cut -d/ -f 5

And group ID for the first one with: stat --format %g /dev/dri/$(grep -l 0x8086 /sys/class/drm/renderD*/device/vendor | cut -d/ -f 5 | head -1)

Yes, that's a good idea. In which all documents you think this should be mentioned?

Here in the Readme and in the "Installation procedure" in release notes would be good. Thanks!

Thanks! @JablonskiMateusz ?

sumseq commented 4 weeks ago

I am having a similar issue issue after upgrading from Rocky 9.2 to Rocky 9.4. I see my Arc 750 in "lspci" but not in clinfo and I cannot run codes on it. My username is part of the "render" group and I have the Redhat 9.3 driver installed along with OneAPI HPC toolkit 2024.2. Any ideas?

eero-t commented 4 weeks ago

@sumseq I'm not familiar with Rocky, but maybe your kernel and user-space driver do not match anymore after the update? See https://github.com/intel/compute-runtime/issues/710.

sumseq commented 4 weeks ago

@sumseq I'm not familiar with Rocky, but maybe your kernel and user-space driver do not match anymore after the update? See #710.

Thanks for the reference! The environment variables they say to set in that post make it work! For reference:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48