Closed schwesig closed 7 months ago
CC: @hpdempsey
@hpdempsey raised a question this morning about how to ensure that GPU workloads run on nodes that have a GPU.
The production cluster runs the node feature discovery (nfd) operator. This applies labels to nodes based on available hardware features. We're also using the Nvidia gpu feature discovery plugin to generate labels based on GPU resources.
Taking node wrk-89
as an example, this gives us the following labels:
{
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/os": "linux",
"cluster.ocs.openshift.io/openshift-storage": "",
"feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
"feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
"feature.node.kubernetes.io/cpu-cpuid.HLE": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
"feature.node.kubernetes.io/cpu-cpuid.MPX": "true",
"feature.node.kubernetes.io/cpu-cpuid.RTM": "true",
"feature.node.kubernetes.io/cpu-cpuid.SSE4": "true",
"feature.node.kubernetes.io/cpu-cpuid.SSE42": "true",
"feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
"feature.node.kubernetes.io/cpu-cstate.enabled": "true",
"feature.node.kubernetes.io/cpu-hardware_multithreading": "true",
"feature.node.kubernetes.io/cpu-pstate.status": "passive",
"feature.node.kubernetes.io/cpu-pstate.turbo": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTCMT": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTL3CA": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTMBA": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTMBM": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTMON": "true",
"feature.node.kubernetes.io/kernel-config.DMI": "true",
"feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
"feature.node.kubernetes.io/kernel-config.X86": "true",
"feature.node.kubernetes.io/kernel-selinux.enabled": "true",
"feature.node.kubernetes.io/kernel-version.full": "5.14.0-284.32.1.el9_2.x86_64",
"feature.node.kubernetes.io/kernel-version.major": "5",
"feature.node.kubernetes.io/kernel-version.minor": "14",
"feature.node.kubernetes.io/kernel-version.revision": "0",
"feature.node.kubernetes.io/memory-numa": "true",
"feature.node.kubernetes.io/pci-0200_14e4.present": "true",
"feature.node.kubernetes.io/pci-0300_102b.present": "true",
"feature.node.kubernetes.io/pci-0302_10de.present": "true",
"feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
"feature.node.kubernetes.io/system-os_release.ID": "rhcos",
"feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.13",
"feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "413.92.202309112228-0",
"feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.2",
"feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.13",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "13",
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "wrk-89",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/worker": "",
"node.openshift.io/os_id": "rhcos",
"nvidia.com/cuda.driver.major": "535",
"nvidia.com/cuda.driver.minor": "129",
"nvidia.com/cuda.driver.rev": "03",
"nvidia.com/cuda.runtime.major": "12",
"nvidia.com/cuda.runtime.minor": "2",
"nvidia.com/gfd.timestamp": "1708636251",
"nvidia.com/gpu-driver-upgrade-state": "upgrade-done",
"nvidia.com/gpu.compute.major": "7",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "1",
"nvidia.com/gpu.deploy.container-toolkit": "true",
"nvidia.com/gpu.deploy.dcgm": "true",
"nvidia.com/gpu.deploy.dcgm-exporter": "true",
"nvidia.com/gpu.deploy.device-plugin": "true",
"nvidia.com/gpu.deploy.driver": "true",
"nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
"nvidia.com/gpu.deploy.node-status-exporter": "true",
"nvidia.com/gpu.deploy.nvsm": "",
"nvidia.com/gpu.deploy.operator-validator": "true",
"nvidia.com/gpu.family": "volta",
"nvidia.com/gpu.machine": "PowerEdge-R740xd",
"nvidia.com/gpu.memory": "32768",
"nvidia.com/gpu.present": "true",
"nvidia.com/gpu.product": "Tesla-V100-PCIE-32GB",
"nvidia.com/gpu.replicas": "1",
"nvidia.com/mig.capable": "false",
"nvidia.com/mig.strategy": "single"
}
We can use these labels to control the placement of pods through node affinity rules. For example, to ensure that Pods in a Deployment run only on nodes that have a GPU (of any sort), we might write:
apiVersion: apps/v1
kind: Deployment
metadata:
name: affinity-example
spec:
replicas: 1
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
containers:
- image: docker.io/traefik/whoami:latest
imagePullPolicy: Always
name: whoami
env:
- name: WHOAMI_PORT_NUMBER
value: "8080"
ports:
- name: http
protocol: TCP
containerPort: 8080
To limit node selections to nodes with a particular GPU model, we could modify our affinity configuration to use more specific selectors, such as gpu.family
or gpu.product
:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- "Tesla-V100-PCIE-32GB"
Question on quantity tests:
PR on quantity test: https://github.com/OCP-on-NERC/ope-tests/pull/3 Update: currently GPU can not be shared among notebooks on prod cluster. Each notebook claims one GPU. We need to figure out how to support multi-instance GPU: https://www.redhat.com/en/blog/multi-instance-gpu-support-with-the-gpu-operator-v1.7.0 (@larsks pointed out that we can test it on test cluster)
@DanNiESh
- which image should we use to create the notebook? CUDA image?
IMHO yes: but I am not sure if they can be used for @computate quality tests @larsks, @dystewart: can we use the Base OPE Image also for attaching GPUs? https://developer.nvidia.com/cuda-zone
- Do we need to test that GPUs are running properly on notebooks after they are launched?
For the basic quantity test, no. This should be done by the quality test by Chris. BUT: if there is an already known and easy way to put some load/calculation on the GPU when spinning up the image, to have some metrics, that would be a nice to have.
@DanNiESh
PR on quantity test: OCP-on-NERC/ope-tests#3 ... We need to figure out how to support multi-instance GPU...
@schwesig I will use the Tensorflow image with GPUs if possible for my tests.
current status 2024-03-13:
Just noting I added an additional 4xA100 GPU node to the prod cluster (wrk-90
)
current status 2024-03-19:
last bullet point done Documentation and approval (here in this issue and connected issues/PRs)
Tracking
Follow ups:
Motivation
To enhance the capabilities of the NERC OpenShift cluster, we plan to add NVIDIA A100 SXM4 GPUs to the infrastructure.
Update 2024-03-12: Completion Criteria GPU would be 12 nodes (each 4 GPUs) added to NERC OpenShift. Currently 9 GPU nodes are available. 1 to test cluster. 8 to test cluster.
At first on the Test Cluster.This upgrade is scheduled to occur on Tuesday, March 12, 2024, between 9 AM and 12 PM, strategically during the BU spring break (March 9-17) to minimize disruptions.A preliminary installation of one GPU on the Test Cluster is set for this week (March 7th), as a preparatory step.Timeline needs update
Objective
The primary goals of this initiative are to ensure that the newly installed GPUs are:
To validate the GPUs' performance and readiness for future class workloads, particularly for jupyter notebooks running on RHOAI, we will conduct both quantity and quality tests.
Requirements
Completion Criteria
Team Members and Responsibilities
Connected or Related Issues