intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
34 stars 203 forks source link

Pod cannot detect Arc A380 GPU #1798

Closed SeanOMik closed 1 month ago

SeanOMik commented 1 month ago

Describe the support request I'm using k3s on a ubuntu host. I installed intel device plugins through helm on the cluster. The install looks good, all pods are up:

> k -n intel-gpu get po
NAME                                                     READY   STATUS    RESTARTS       AGE
inteldeviceplugins-controller-manager-64dff9d644-kc5vw   2/2     Running   1 (136m ago)   9h
intel-gpu-plugin-intel-gpu-plugin-n2lm5                  1/1     Running   0              123m

The issue I'm running into is that when I try to give a GPU to a pod and use it, the pod can't detect it. I tried following the verification steps on the docs, which didn't work. This is the output of the intelgpu-demo pod (the one that runs clinfo):

> k logs -f intelgpu-demo-job-jnhdz
Number of platforms                               0

I also tried to give the GPU to jellyfin, but jellyfin fails to start ffmpeg for encoding. I exec'd into the jellyfin pod and was able to see the card: image I was also able to see that same result in the demo pod by changing the command to infinitely sleep and executing into it.

On the node I'm able to run clinfo and it outputs a bunch of stuff. I can use intel_gpu_top to see the card's usage without any issue. I was able to give the card to a docker container on this same node in the past and it worked great, not sure why its not working in kubernetes. Any help appreciated!

System (please complete the following information if applicable):

tkatila commented 1 month ago

Hi @SeanOMik, weird. I've seen issues with access rights where the render device cannot be accessed. This can be fixed with a securityContext addition:

      containers:
      - name: test
        securityContext:
          runAsGroup: 109 # <-- this should match host's render group, i.e. videogp95's number

But render accessibility shouldn't cause issues with the demo Pod. You could add strace to the container and see which access fails: strace -e openat -f clinfo

SeanOMik commented 1 month ago

Hi @SeanOMik, weird. I've seen issues with access rights where the render device cannot be accessed. This can be fixed with a securityContext addition:

      containers:
      - name: test
        securityContext:
          runAsGroup: 109 # <-- this should match host's render group, i.e. videogp95's number

@tkatila I tried that fix, but it didn't work. My host is Ubuntu, so I thought the name of the render group is render, the gid for that on my host is 993. That didn't work though. When I exec'd into the pod I got this warning groups: cannot find name for group ID 993, I tried to run clinfo anyway but it didn't show the gpu. I tried gid 109 like in the example above which got the same result, but different gid in the warning. I also found a group on the host called video, with a gid of 44. I tried that and didn't get the warning, but clinfo still didn't show the gpu.

But render accessibility shouldn't cause issues with the demo Pod. You could add strace to the container and see which access fails: strace -e openat -f clinfo

Here's the result of running that in the container:

root@intelgpu-demo-job-bsrkh:/# strace -e openat -f clinfo
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libOpenCL.so.1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/OpenCL/vendors", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
openat(AT_FDCWD, "/etc/OpenCL/vendors/intel.icd", O_RDONLY) = 4
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libigdgmm.so.12", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libm.so.6", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "igdrcl.config", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "neo.config", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/dev/dri/by-path", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
openat(AT_FDCWD, "/dev/dri/by-path/pci-0000:06:10.0-render", O_RDWR|O_CLOEXEC) = 4
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:06:10.0/drm", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 5
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:06:10.0/drm/card0/prelim_uapi_version", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:06:10.0/drm", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 5
openat(AT_FDCWD, "/sys/bus/pci/devices/0000:06:10.0/drm/card0/gt_max_freq_mhz", O_RDONLY) = 5
Number of platforms                               0
+++ exited with 0 +++
SeanOMik commented 1 month ago

Something I forgot to mention is that the k3s host, which is Ubuntu, is running as a VM in proxmox. The GPU is passed through fine, the docker container was able to use it. I, of course, blacklisted the drivers on the VM host to pass it through. Just to make sure, I don't need to black list the drivers on the kubernetes node as well, right?

tkatila commented 1 month ago

@tkatila I tried that fix, but it didn't work. My host is Ubuntu, so I thought the name of the render group is render, the gid for that on my host is 993. That didn't work though. When I exec'd into the pod I got this warning groups: cannot find name for group ID 993, I tried to run clinfo anyway but it didn't show the gpu. I tried gid 109 like in the example above which got the same result, but different gid in the warning. I also found a group on the host called video, with a gid of 44. I tried that and didn't get the warning, but clinfo still didn't show the gpu.

The warning is fine afaik. As long as the gid matches the one on the host, it should be ok.

Here's the result of running that in the container:

Thanks. There's nothing wrong in that trace.

You mentioned in the original post that you can access&use the GPU in docker. What container is used in that case?

What you could also do is to run the strace -f clinfo on the host and within the Kubernetes Pod, and see where the execution differentiates. I think this is an issue with the libraries in the container not supporting the kernel and/or card.

tkatila commented 1 month ago

I tested a similar scenario but with a different GPU (Integrated Tigerlake). VM host with 24.04 + 6.8.0-39-generic and k3s. With that opencl demo Pod works ok: clinfo provides device details.

SeanOMik commented 1 month ago

So I actually noticed that clinfo -l on the host wasn't seeing the GPU, but instead rusticl and Clover. I fixed it by following these Intel docs for installing drivers and some other things and now it does show it:

> sudo clinfo -l
Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A380 Graphics
Platform #1: rusticl
Platform #2: Clover

However, even after a system reboot, the pod does not see the card:

> k exec -it intelgpu-demo-job-xrhsh -- clinfo
Number of platforms                               0

You mentioned in the original post that you can access&use the GPU in docker. What container is used in that case?

It was the official plex container: plexinc/pms-docker.

What you could also do is to run the strace -f clinfo on the host and within the Kubernetes Pod, and see where the execution differentiates. I think this is an issue with the libraries in the container not supporting the kernel and/or card.

I compared the output of strace from the pod and host and I don't see any errors in the pod one, but I also don't really know what I'm looking for. The host trace is a lot longer and has some errors on missing /dev/dri/renderD### but I dont think that anything. pod-strace-clinfo.txt host-strace-clinfo.txt

I tested a similar scenario but with a different GPU (Integrated Tigerlake). VM host with 24.04 + 6.8.0-39-generic and k3s. With that opencl demo Pod works ok: clinfo provides device details.

Hmm... I've talked with some people on discord who have gotten this working using integrated GPU's as well, but my setup differs slightly from having a dedicated GPU, an Arc A380.

tkatila commented 1 month ago

It was the official plex container: plexinc/pms-docker.

Thanks. Sadly the container is built from a prebuild binary so it's hard to see what ingredients it includes.

I compared the output of strace from the pod and host and I don't see any errors in the pod one, but I also don't really know what I'm looking for. The host trace is a lot longer and has some errors on missing /dev/dri/renderD### but I dont think that anything.

The communication with the GPU stops after some ioctl's where as in the Host the communication continues. My hunch is that something in the user space libraries is not liking the 380 hardware and won't use it. I don't have access to a A380 at the moment.

SeanOMik commented 1 month ago

The communication with the GPU stops after some ioctl's where as in the Host the communication continues. My hunch is that something in the user space libraries is not liking the 380 hardware and won't use it. I don't have access to a A380 at the moment.

Hm, okay... Well I'll just go back to using the docker container for plex. I don't know enough about this GPU's and drivers and stuff to help much, sorry about that. Thanks for your time though!

SeanOMik commented 1 month ago

I tried running intel_gpu_top in the pod and got this output:

root@intelgpu-demo-job-bkdck:/# intel_gpu_top
Failed to initialize PMU! (Permission denied)

Not sure if that gives any more information. This is when I run the pod as the render group (993 for my ubuntu host)

tkatila commented 1 month ago

I'll try to reproduce the scenario with a 750 card I have access to. But it might take some time so don't hold your breath.

eero-t commented 1 month ago

I tried running intel_gpu_top in the pod and got this output:

root@intelgpu-demo-job-bkdck:/# intel_gpu_top
Failed to initialize PMU! (Permission denied)

I think kernel requires root user and PERFMON capability from processes accessing PMU (perf) metrics.

Those are not required to for normal GPU (write) access though, just some user or group matching the GPU device file.

SeanOMik commented 1 month ago

I think kernel requires root user and PERFMON capability from processes accessing PMU (perf) metrics.

Those are not required to for normal GPU (write) access though, just some user or group matching the GPU device file.

You're right about that. I added privileged: true to the securityContext of the pod, and intel_gpu_top works! However, clinfo still does not work. I would also like to avoid running the pod as privileged.

Here's a strace of clinfo on the privileged pod: pod-privileged-strace-clinfo.txt

eero-t commented 1 month ago

You're right about that. I added privileged: true to the securityContext of the pod, and intel_gpu_top works!

You don't need privileged mode. Access to GPU device file (i915 resource), root (0) user and PERFMON capability should be enough for i-g-t, e.g. all other capabilities can be dropped.

(If you're using ancient Docker version, you may need to use SYS_ADMIN capability instead of PERFMON one.)

However, clinfo still does not work. I would also like to avoid running the pod as privileged.

You do not need privileged mode, or any capabilities for normal GPU usage. Elevated privileges are needed only for some of the metrics (power & perf) used by i-g-t.

Here's a strace of clinfo on the privileged pod: pod-privileged-strace-clinfo.txt

Try the same driver version in your pod as you have on the host. I think the driver version in the pod is not compatible with your kernel version, possibly due to: https://github.com/intel/compute-runtime/issues/710

SeanOMik commented 1 month ago

Try the same driver version in your pod as you have on the host. I think the driver version in the pod is not compatible with your kernel version, possibly due to: intel/compute-runtime#710

It seems to be the kernel version I was on! After reading that issue you sent, I noticed I was also on the same kernel version, 6.8.0. I updated my system (apt-get upgrade) which updated the kernel to 6.8.0-41-generic from 6.8.0-38-generic. It upgraded a lot of other packages, that may have included the intel drivers.

I upgraded the demo pod to match the same ubuntu version I'm on, 24.04. At first the demo pod was still not recognizing any devices, but the output was a tiny bit longer, so I tried going through the steps listed on the docs for installing the drivers in the pod. After I followed those steps clinfo listed the GPU!

I attached the GPU to the jellyfin pod, enabled hardware transcoding, and it worked! Thanks for the help!!

tkatila commented 1 month ago

Great to hear that @SeanOMik ! Thanks @eero-t for pointing to the issue!