Open larsks opened 3 months ago
@jtriley, @Milstein, and @aabaris what are your thoughts on this?
@joachimweyl update description
Decision made that yes we need to prevent workloads from scheduling on GPU nodes if they are not requesting GPU.
@dystewart what is the status of this issue?
In talking with @naved001 about this today, it looks like we probably want to make use of taints and tolerations.
In the "Example use cases" section of the taints and tolerations documentation they specifically call out creating dedicated GPU nodes.
They suggest enabling the ExtendedResourceTolerations
admission controller, which is disabled by default. I think that to enable this in OpenShift we will need to edit the cluster
kubeapiserver
resource to add the appropriate command line option.
I worked a bit with admission controllers for ope this winter, I'll look at implementing and testing ^
Cordoned wrk-10 node and drained on nerc-ocp-test with:
oc adm cordon wrk-10
oc adm drain wrk-10 --ignore-daemonsets --delete-emptydir-data
Looking at making wrk-10 a dedicated GPU node as suggested in: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#example-use-cases
Tainted wrk-10 node: kubectl taint nodes wrk-10 nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB-MIG-1g.5gb:NoSchedule
uncordoned wrk-10 node: oc adm uncordon wrk-10
Attempt to schedule gpu workload onto wrk-10 without toleration:
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
Fails as expected:
Warning FailedScheduling 26s default-scheduler 0/14 nodes are available: 1 node(s) had untolerated taint {nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb}, 10 Insufficient nvidia.com/mig-1g.5gb, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/14 nodes are available: 10 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.
Attempting to deploy same workload with taint toleration:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
tolerations:
- key: "nvidia.com/gpu.product"
operator: "Equal"
value: "NVIDIA-A100-SXM4-40GB-MIG-1g.5gb"
effect: "NoSchedule"
And the deployment succeeds!
So this looks to be the way we should taint the nodes: kubectl taint nodes wrk-10 nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB-MIG-1g.5gb:NoSchedule
To be clear, the only way to schedule workloads on a node that is tainted in this way is to explicitly add the toleration to the workload. Workloads without the appropriate toleration will be scheduled elsewhere
How do we manage the toleration. Is that a simple process we can automatically give to all containers with GPU listed in the resources?
@joachimweyl that is where the admission controller comes in. When a user requests a gpu their resource is intercepted before creation and patched by the mutating webhook with the toleration. This is nice bc it all happens behind the scenes.
Enabling and playing with the webhook today
What are the next steps to getting this implemented?
@dystewart do we have a PR to link this to?
Motivation
We don't currently have any restrictions in place to prevent non-GPU workloads from getting scheduled on the GPU nodes. Do we want to introduce such a restriction? That is, should the GPU nodes be for only GPU workloads?
Completion Criteria
Non-GPU requests will never access GPU node resources. These need to be reserved for GPU requests.
Description
Completion dates
Desired - 2024-04-17 Required - 2024-05-08