replicatedhq / troubleshoot

Preflight Checks and Support Bundles Framework for Kubernetes Applications
https://troubleshoot.sh
Apache License 2.0
543 stars 93 forks source link

Feature: add GPU capabilities to `nodeResources` analyzer #1162

Open adamancini opened 1 year ago

adamancini commented 1 year ago

Describe the rationale for the suggested feature.

It would be good to be able to support preflights that want to check for GPU scheduling capability. Off-hand, I don't know if this is visible in node metadata, but maybe could be detected from containerd configuration? This might require a new collector or modifications to the nodeResources collector to detect if a node is capable of scheduling GPUS, and provide capacity/allocation similar to CPU, Memory, Disk.

Describe the feature

Not sure exactly which fields would be required, if Allocatable makes sense, but at a minimum something like:

gpuCapacity - # of GPUs available to a node

so you can write expressions like

- nodeResources:
        checkName: Total GPU Cores in the cluster is 4 or greater
        outcomes:
          - fail:
              when: "sum(gpuCapacity) < 4"
              message: The cluster must contain at least 4 GPUs
          - pass:
              message: There are at least 4 GPUs
adamancini commented 1 year ago

Thinking through this a little bit there are a few places we can try to detect for GPU support

  1. containerd configuration
  2. nvidia-smi output
  3. node metadata
  4. run a no-op pod requesting GPUs and wait for successful exit

2: This can at least tell us if a GPU is installed, but not if Kubernetes is configured 3: I don't know if the information we need is exposed in node metadata, requires research 1,4: I think are the best options since they are the closes to a functional test confirming that GPU workloads can be scheduled

diamonwiggins commented 1 year ago

Adding some thoughts from a discussion in Slack, on the node metadata angle, we may be able to determine from containerRuntimeVersion at least when the nvidia-container-runtime for containerd is being used. Not sure if that'll be robust enough though. Imagine it could work for most cases.

from my local env:

    nodeInfo:
      architecture: amd64
      bootID: 81e20091-22da-4866-bfe4-a980057a1adf
      containerRuntimeVersion: containerd://1.5.9-k3s1
      kernelVersion: 5.15.49-linuxkit
      .....
chris-sanders commented 1 year ago

Just chiming in on the number of GPU's question. I think this is going to be implementation specific. I don't know if we can measure it. I know for the Intel Gpu Plugin it can be configured to allow sharing gpu's or not. So the question isn't just how many gpu's are present but are they all fully scheduled.

I think we're going to have to be specific on the gpu drives and providers to make any real attempt at this. Creating a pod seems like the most universal method but it's going to require the user to define that pod. Again using the Intel Gpu Driver there is no containerd configuration to review and the tracking of the resources is via a resource line that calls requires the gpu driver be listed explicitly.

Here's an example:

  resources:
    limits:
      gpu.intel.com/i915: 1

Example of a node with intel gpu plugin. This node has both coral-tpu and intel-gpu's available. It's not configured to allow gpu sharing, so I'm not sure if the allocatable number would change if that were enabled. You'll notice containerd has no special configs. The coral-tpu doesn't show up as a resource it's just identified via a label from node-feature-discovery. It is a usb device, but I don't think that changes if it's an integrated device.

Name:               todoroki
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/coral-tpu=true
                    feature.node.kubernetes.io/intel-gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=todoroki
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node.k0sproject.io/role=control-plane
Annotations:        csi.volume.kubernetes.io/nodeid: {"smb.csi.k8s.io":"todoroki"}
                    nfd.node.kubernetes.io/extended-resources:
                    nfd.node.kubernetes.io/feature-labels: coral-tpu,intel-gpu
                    nfd.node.kubernetes.io/master.version: v0.13.0
                    nfd.node.kubernetes.io/worker.version: v0.13.0
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
...
Addresses:
  InternalIP: 
  Hostname:    todoroki
Capacity:
  cpu:                 8
  ephemeral-storage:   489580536Ki
  gpu.intel.com/i915:  1
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              16231060Ki
  pods:                110
Allocatable:
  cpu:                 8
  ephemeral-storage:   451197421231
  gpu.intel.com/i915:  1
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              16128660Ki
  pods:                110
System Info:
  Machine ID:                
  System UUID:                
  Boot ID:                 
  Kernel Version:             5.4.0-137-generic
  OS Image:                   Ubuntu 20.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18
  Kubelet Version:            v1.26.2+k0s
  Kube-Proxy Version:         v1.26.2+k0s
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource            Requests     Limits
  --------            --------     ------
  cpu                 1630m (20%)  2 (25%)
  memory              1072Mi (6%)  1936Mi (12%)
  ephemeral-storage   0 (0%)       0 (0%)
  hugepages-1Gi       0 (0%)       0 (0%)
  hugepages-2Mi       0 (0%)       0 (0%)
  gpu.intel.com/i915  1            1
DexterYan commented 1 year ago

If those nodes are running on cloud, we can use instance metadata to get GPU information. Like AWS, it has

elastic-gpus/associations/elastic-gpu-id

However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.

chris-sanders commented 1 year ago

However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.

I'm not sure what this part is referring to. This is about troubleshoot detecting the presences of gpu's not about kurl installing drivers that's out of scope for troubleshoot. How the drivers or gpu gets setup is only relevant here as it pertains to detection. As long as troubleshoot has a way to detect a gpu we don't particularly need to care how it got installed.

diamonwiggins commented 1 year ago

I think @chris-sanders has landed on what I think will be the best approach here after digging into this more and talking with some customers. I think we'd essentially have one or more collectors that can do similar feature discovery as the below projects and then let an analyzer analyze on the configuration collected. See:

https://github.com/kubernetes-sigs/node-feature-discovery https://github.com/NVIDIA/gpu-feature-discovery

edit. With that being said, not sure if we should start capturing this in a separate issue since I'm not sure if what i'm describing makes sense in the nodeResources analyzer 🤔

xavpaice commented 2 months ago

https://app.shortcut.com/replicated/story/106618/in-cluster-collector-gpu-inventory