intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
48 stars 205 forks source link

Requesting Intel ARC GPU Resource #1907

Closed DizzieNight closed 1 week ago

DizzieNight commented 1 week ago

Describe the bug I am trying to add my arc gpu to my jellyfin pod. I have the NFD installed and it correctly labelling my node with the Intel Arc A310 with the following labels: nfd.node.kubernetes.io/feature-labels=gpu.intel.com/device-id.0300-56a6.count gpu.intel.com/device-id.0300-56a6.present gpu.intel.com/device-id.0380-1912.count gpu.intel.com/device-id.0380-1912.present gpu.intel.com/family,intel.feature.node.kubernetes.io/gpu

But I put the following into my jellyfin deployment: resources: requests: gpu.intel.com/i915: "1" limits: gpu.intel.com/i915: "1"

but it still won't find a node with a gpu. It keep coming up with the following error: 0/8 nodes are available: 1 node(s) were unschedulable, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 4 Insufficient gpu.intel.com/i915. preemption: 0/8 nodes are available: 4 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.

To Reproduce Install Arc GPU, install NFD and request i915.

Expected behavior Jellyfin pod should attach to worker 4 which has the Arc A310 GPU

Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

Additional context Add any other context about the problem here.

tkatila commented 1 week ago

Hi @DizzieNight, did you also deploy the Intel GPU device plugin to the cluster? NFD only doesn't yet suffice.

DizzieNight commented 1 week ago

I just checked and it doesn't seem so actually. I get this error when trying to install using the helm chart

Helm install failed for release node-feature-discovery/intel-gpu-plugin with chart intel-device-plugins-gpu@0.31.1: unable to build kubernetes objects from release manifest: resource mapping not found for name: "gpudeviceplugin" namespace: "" from "": no matches for kind "GpuDevicePlugin" in version "deviceplugin.intel.com/v1" ensure CRDs are installed first

I couldn't find where to install the CRDs though. Any thoughts?

tkatila commented 1 week ago

Helm install builds on the operator. Please see the steps here: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/INSTALL.md#install-with-helm-charts

You can also install gpu-plugin via kubectl: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/README.md#install-with-nfd

DizzieNight commented 1 week ago

I have installed everything and my node is getting the labels but nfd is coming up with a warning when installing:

W1115 12:36:57.853014 1148310 warnings.go:70] would violate PodSecurity "restricted:latest": restricted volume types (volumes "host-boot", "host-os-release", "host-sys", "host-usr-lib", "host-lib", "host-proc-swaps", "source-d", "features-d" use restricted volume type "hostPath")

And I am not sure how to fix it

DizzieNight commented 1 week ago

Nevermind, set the namespace to privileged using the following commands:

kubectl label namespace node-feature-discovery pod-security.kubernetes.io/enforce=privileged kubectl label namespace node-feature-discovery pod-security.kubernetes.io/audit=privileged kubectl label namespace node-feature-discovery pod-security.kubernetes.io/warn=privileged.

Although jellyfin still won't pick a node. The node I want jellyfin to install to has these labels, I don't see i915 here anywhere though. Does Arc use i915 as well or do I have to set it to something else?

image

tkatila commented 1 week ago

NFD seems to be working fine.

Can you check a few things: 1) Is the GPU device plugin running on the node? Check pods for that specific node. 1) Describe the target node and see if it has "gpu.intel.com/i915" resource?

DizzieNight commented 1 week ago

I did just notice the plugin isn't being created and this is what I get: image

Not really sure how to fix it though

mythi commented 1 week ago

Not really sure how to fix it though

this looks to be the same pod security admission error you saw with NFD. The fix should be to label the operator/plugin namespace with the privileged PSA settings.

DizzieNight commented 1 week ago

Yep that worked, it's attaching now. Thank you for your help

mythi commented 1 week ago

Yep that worked, it's attaching now. Thank you for your help

I created #1909 to make the experience a bit smoother.