kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
422 stars 145 forks source link

Support Nvidia GPU Feature Discovery #1219

Open p53 opened 1 month ago

p53 commented 1 month ago

Description

Original Title: Ignore node selector labels for provisioning

What problem are you trying to solve?

We have nvidia operator which installs nvidia runtime etc.. on karpenter nodes after they are provisioned, operator runs feature discovery and applies appropriate nvidia labels, we need to place pods on these karpenter nodes depending on these nvidia labels. Problem is that when i place nvidia labels in nodeSelector on pod, which are not in NodePool, because they are placed on nodes during node runtime by nvidia operator, karpenter will fail to provision nodes. Solution might be e.g. placing some annotations on pod e.g. karpenter.sh/ignore-label=somelabel so that karpenter ignores this label during provisioning

How important is this feature to you?

jonathan-innis commented 1 month ago

operator runs feature discovery and applies appropriate nvidia labels

What kind of feature discovery are you talking about here? Is it stuff related to the properties of the instance type that we are launching?

Bryce-Soghigian commented 1 month ago

https://github.com/NVIDIA/gpu-feature-discovery?tab=readme-ov-file#deploy-nvidia-gpu-feature-discovery-gfd

gfd adds labels after the nodes have already been created.

$ kubectl get nodes -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Node
  metadata:
    ...

    labels:
      nvidia.com/cuda.driver.major: "455"
      nvidia.com/cuda.driver.minor: "06"
      nvidia.com/cuda.driver.rev: ""
      nvidia.com/cuda.runtime.major: "11"
      nvidia.com/cuda.runtime.minor: "1"
      nvidia.com/gpu.compute.major: "8"
      nvidia.com/gpu.compute.minor: "0"
      nvidia.com/gfd.timestamp: "1594644571"
      nvidia.com/gpu.count: "1"
      nvidia.com/gpu.family: ampere
      nvidia.com/gpu.machine: NVIDIA DGX-2H
      nvidia.com/gpu.memory: "39538"
      nvidia.com/gpu.product: A100-SXM4-40GB
      ...
...
Bryce-Soghigian commented 1 month ago

basically you are requesting a workload requiring a node with those labels that we create a node with those labels, but the nodepool is not aware of these labels and we wont be aware of them. They aren't added until gfd goes and adds them. They are added after gpu nodes are provisioned?

How can karpenter know these traits? Seems relevant to per instance type overrides. If you know particular instance types will have particular traits then we can override a configmap to say these instance types have these values for the overrides.

Do these values differ from node to node? Seems cuda runtime is dependent on the gpu drivers installed on the node? We can't just cache them directly.

p53 commented 1 month ago

basically you are requesting a workload requiring a node with those labels that we create a node with those labels, but the nodepool is not aware of these labels and we wont be aware of them. They aren't added until gfd goes and adds them. They are added after gpu nodes are provisioned? = yup that's right

How can karpenter know these traits? Seems relevant to per instance type overrides. If you know particular instance types will have particular traits then we can override a configmap to say these instance types have these values for the overrides. - i don't know how karpenter precisely works internally, it is probably possible to know these labels, at least part of them ahead of time and configure them statically, best would be if we would not need to define them in config statically

Do these values differ from node to node? Seems cuda runtime is dependent on the gpu drivers installed on the node? We can't just cache them directly. - we have e.g. all AWS g5 intances in one nodepool so for sure they will be different for each instance type, depending on gpu type of instance type, having each instance type in separate nodepool would be quite impractical

p53 commented 1 month ago

DRA -> https://github.com/kubernetes-sigs/karpenter/issues/1231 probably solve thing = "knowing before" as third-party drivers would present noderesourceslices when running on cluster altough not sure about its flexibility in terms we are still assuming that something is there before and it is constrained only on resources

p53 commented 1 month ago

Also e.g. node feature discovery adds labels to nodes e.g CPU capabilities

jonathan-innis commented 1 month ago

best would be if we would not need to define them in config statically

I think the ideal state here is defining what the different configurations can be for the GPU feature discovery operator and then see if we can surface first-class support for these in Karpenter directly.

Like you mentioned, having to statically configure all of these values is going to be a huge pain, ideally Karpenter can auto-discover them by matching its logic up with what Nvidia tells us should be on these instance types.


I'm wondering if it makes sense to retitle this issue to be more specific to the use-case. Something like: "Support Nvidia GPU Feature Discovery". @p53 What do you think?

jonathan-innis commented 1 month ago

/triage accepted

p53 commented 1 month ago

@jonathan-innis renamed