intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
48 stars 205 forks source link

GPU: i915 resources are not being added to nodes on Talos #1826

Closed djryanj closed 2 months ago

djryanj commented 2 months ago

Describe the bug Talos running on an Intel NUC12WSKv5 with an Intel i5-1250P processor, the resources for gpu.intel.com/i915 are not added to the nodes. An identical node running k3s shows the resources.

As a result, pods requesting the gpu.intel.com/i915 cannot be scheduled without removing those requests, and the operator therefore doesn't automatically add the volume mounts for /dev/dri to the pod.

Note that adding the volume mounts manually does give correct access to the GPU.

The gpu.intel.com/device-id labels exist fine, and are identical, but the resources are not added.

To Reproduce

Expected behavior Resources should be added to the node correctly.

Screenshots From Talos:

kubectl describe node talos-nuc-2                                           
Name:               talos-nuc-2
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    gpu.intel.com/device-id.0300-46a6.count=1
                    gpu.intel.com/device-id.0300-46a6.present=true
                    intel.feature.node.kubernetes.io/gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=talos-nuc-2
                    kubernetes.io/os=linux
                    kubernetes.io/role=worker
                    node-type=worker
Annotations:        csi.volume.kubernetes.io/nodeid: {"driver.longhorn.io":"talos-nuc-2"}
                    nfd.node.kubernetes.io/feature-labels:
                      gpu.intel.com/device-id.0300-46a6.count,gpu.intel.com/device-id.0300-46a6.present,intel.feature.node.kubernetes.io/gpu
                    node.alpha.kubernetes.io/ttl: 0
                    talos.dev/owned-labels: ["intel.feature.node.kubernetes.io/gpu"]
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 28 Jun 2024 11:47:28 -0600
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  talos-nuc-2
  AcquireTime:     <unset>
  RenewTime:       Mon, 02 Sep 2024 08:28:49 -0600
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message    
  ----                 ------  -----------------                 ------------------                ------                       -------    
  NetworkUnavailable   False   Fri, 28 Jun 2024 11:47:47 -0600   Fri, 28 Jun 2024 11:47:47 -0600   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Mon, 02 Sep 2024 08:25:39 -0600   Fri, 26 Jul 2024 16:42:03 -0600   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 02 Sep 2024 08:25:39 -0600   Fri, 26 Jul 2024 16:42:03 -0600   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 02 Sep 2024 08:25:39 -0600   Fri, 26 Jul 2024 16:42:03 -0600   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 02 Sep 2024 08:25:39 -0600   Fri, 26 Jul 2024 16:42:03 -0600   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  <snip>
  Hostname:    talos-nuc-2
Capacity:
  cpu:                16
  ephemeral-storage:  486916820Ki
  hugepages-2Mi:      0
  memory:             15914900Ki
  pods:               110
Allocatable:
  cpu:                15950m
  ephemeral-storage:  448474105114
  hugepages-2Mi:      0
  memory:             15615892Ki
  pods:               110
System Info:
  Machine ID:                     <snip>
  System UUID:                    <snip>
  Boot ID:                        <snip>
  Kernel Version:                 6.6.32-talos
  OS Image:                       Talos (v1.7.4)
  Operating System:               linux
  Architecture:                   amd64
  Container Runtime Version:      containerd://1.7.16
  Kubelet Version:                v1.30.0
  Kube-Proxy Version:             v1.30.0
PodCIDR:                          10.244.1.0/24
PodCIDRs:                         10.244.1.0/24
Non-terminated Pods:              (42 in total)
  Namespace                       Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                       ----                                                    ------------  ----------  ---------------  -------------  ---
<snip>
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                2994m (18%)  550m (3%)
  memory             804Mi (5%)   1458Mi (9%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:              <none>

From k3s:

kubectl describe node k3s-nuc-1
Name:               k3s-nuc-1
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    gpu.intel.com/device-id.0300-46a6.count=1
                    gpu.intel.com/device-id.0300-46a6.present=true
                    intel.feature.node.kubernetes.io/gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=k3s-nuc-1
                    kubernetes.io/os=linux
                    kubernetes.io/role=worker
                    node-type=worker
                    node.kubernetes.io/instance-type=k3s
Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.10.24
                    csi.volume.kubernetes.io/nodeid: {"driver.longhorn.io":"k3s-nuc-1"}
                    flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"e2:ab:8e:c8:a7:cd"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.10.24
                    k3s.io/hostname: k3s-nuc-1
                    k3s.io/internal-ip: 192.168.10.24
                    k3s.io/node-args: ["agent"]
                    k3s.io/node-config-hash: G56ZUNSCSWM2KTHE3DXXTXMKUXSYZOCXIXSUFXFJBASZOPDWFEVA====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/28f7e87eba734b7f7731dc900e2c84e0e98ce869f3dcf57f65dc7bbb80e12e56","K3S_TOKEN":"********","K3S_U...
                    nfd.node.kubernetes.io/feature-labels:
                      gpu.intel.com/device-id.0300-46a6.count,gpu.intel.com/device-id.0300-46a6.present,intel.feature.node.kubernetes.io/gpu
                    nfd.node.kubernetes.io/master.version: v0.14.1
                    nfd.node.kubernetes.io/worker.version: v0.14.1
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 30 Mar 2023 15:37:01 -0600
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  k3s-nuc-1
  AcquireTime:     <unset>
  RenewTime:       Mon, 02 Sep 2024 08:41:10 -0600
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason
    Message
  ----             ------  -----------------                 ------------------                ------
    -------
  MemoryPressure   False   Mon, 02 Sep 2024 08:39:26 -0600   Fri, 17 May 2024 17:04:11 -0600   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 02 Sep 2024 08:39:26 -0600   Fri, 17 May 2024 17:04:11 -0600   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 02 Sep 2024 08:39:26 -0600   Fri, 17 May 2024 17:04:11 -0600   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 02 Sep 2024 08:39:26 -0600   Fri, 17 May 2024 17:04:11 -0600   KubeletReady
    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.10.24
  Hostname:    k3s-nuc-1
Capacity:
  cpu:                            16
  ephemeral-storage:              476445624Ki
  gpu.intel.com/i915:             10
  gpu.intel.com/i915_monitoring:  1
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         15920832Ki
  pods:                           110
Allocatable:
  cpu:                            16
  ephemeral-storage:              463486302664
  gpu.intel.com/i915:             10
  gpu.intel.com/i915_monitoring:  1
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         15920832Ki
  pods:                           110
System Info:
  Machine ID:                 <snip>
  System UUID:                <snip>
  Boot ID:                    <snip>
  Kernel Version:             5.15.0-107-generic
  OS Image:                   Ubuntu 22.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.11-k3s2
  Kubelet Version:            v1.28.5+k3s1
  Kube-Proxy Version:         v1.28.5+k3s1
PodCIDR:                      10.42.3.0/24
PodCIDRs:                     10.42.3.0/24
ProviderID:                   k3s://k3s-nuc-1
Non-terminated Pods:          (44 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
<snip>
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests      Limits
  --------                       --------      ------
  cpu                            8510m (53%)   5400m (33%)
  memory                         2684Mi (17%)  9928Mi (63%)
  ephemeral-storage              0 (0%)        0 (0%)
  hugepages-1Gi                  0 (0%)        0 (0%)
  hugepages-2Mi                  0 (0%)        0 (0%)
  gpu.intel.com/i915             1             1
  gpu.intel.com/i915_monitoring  0             0
Events:                          <none>

System (please complete the following information):

Additional context I realize directly comparing the two may be irrelevant, but regardless the resources information does not show up correctly in Talos.

djryanj commented 2 months ago

Turns out this is not an issue. I neglected to add pod-security.kubernetes.io/enforce: privileged to the inteldeviceplugins-system namespace as required for device plugin pods to work correctly.

Closing.

mythi commented 2 months ago

I neglected to add pod-security.kubernetes.io/enforce: privileged to the inteldeviceplugins-system namespace as required for device plugin pods to work correctly

We have an issue open to have this properly documented so that users don't run into these type of issues. Perhaps the default deployment could add this automatically...