Prevent general workloads from scheduling on the GPU nodes?

larsks commented 3 months ago

Motivation

We don't currently have any restrictions in place to prevent non-GPU workloads from getting scheduled on the GPU nodes. Do we want to introduce such a restriction? That is, should the GPU nodes be for only GPU workloads?

Completion Criteria

Non-GPU requests will never access GPU node resources. These need to be reserved for GPU requests.

Description

[x] Research options of how to do this
[x] Present options
[ ] Implement chosen option

Completion dates

Desired - 2024-04-17 Required - 2024-05-08

joachimweyl commented 3 months ago

@jtriley, @Milstein, and @aabaris what are your thoughts on this?

joachimweyl commented 3 months ago

@joachimweyl update description

joachimweyl commented 2 months ago

Decision made that yes we need to prevent workloads from scheduling on GPU nodes if they are not requesting GPU.

joachimweyl commented 2 months ago

@dystewart what is the status of this issue?

larsks commented 1 month ago

In talking with @naved001 about this today, it looks like we probably want to make use of taints and tolerations.

In the "Example use cases" section of the taints and tolerations documentation they specifically call out creating dedicated GPU nodes.

They suggest enabling the ExtendedResourceTolerations admission controller, which is disabled by default. I think that to enable this in OpenShift we will need to edit the cluster kubeapiserver resource to add the appropriate command line option.

dystewart commented 1 month ago

I worked a bit with admission controllers for ope this winter, I'll look at implementing and testing ^

dystewart commented 3 weeks ago

Cordoned wrk-10 node and drained on nerc-ocp-test with: oc adm cordon wrk-10 oc adm drain wrk-10 --ignore-daemonsets --delete-emptydir-data

Looking at making wrk-10 a dedicated GPU node as suggested in: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#example-use-cases

dystewart commented 3 weeks ago

Tainted wrk-10 node: kubectl taint nodes wrk-10 nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB-MIG-1g.5gb:NoSchedule uncordoned wrk-10 node: oc adm uncordon wrk-10

Attempt to schedule gpu workload onto wrk-10 without toleration:

spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb

Fails as expected:

  Warning  FailedScheduling  26s   default-scheduler  0/14 nodes are available: 1 node(s) had untolerated taint {nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb}, 10 Insufficient nvidia.com/mig-1g.5gb, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/14 nodes are available: 10 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.

Attempting to deploy same workload with taint toleration:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
  tolerations:
  - key: "nvidia.com/gpu.product"
    operator: "Equal"
    value: "NVIDIA-A100-SXM4-40GB-MIG-1g.5gb"
    effect: "NoSchedule"

And the deployment succeeds!

So this looks to be the way we should taint the nodes: kubectl taint nodes wrk-10 nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB-MIG-1g.5gb:NoSchedule

To be clear, the only way to schedule workloads on a node that is tainted in this way is to explicitly add the toleration to the workload. Workloads without the appropriate toleration will be scheduled elsewhere

joachimweyl commented 2 weeks ago

How do we manage the toleration. Is that a simple process we can automatically give to all containers with GPU listed in the resources?

dystewart commented 2 weeks ago

@joachimweyl that is where the admission controller comes in. When a user requests a gpu their resource is intercepted before creation and patched by the mutating webhook with the toleration. This is nice bc it all happens behind the scenes.

Enabling and playing with the webhook today

joachimweyl commented 2 weeks ago

What are the next steps to getting this implemented?

joachimweyl commented 2 days ago

@dystewart do we have a PR to link this to?

nerc-project / operations