nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Testing and Observing NVIDIA A100 GPUs in NERC OpenShift Cluster #466

Closed schwesig closed 7 months ago

schwesig commented 8 months ago

Tracking

Follow ups:

Motivation

To enhance the capabilities of the NERC OpenShift cluster, we plan to add NVIDIA A100 SXM4 GPUs to the infrastructure.

Update 2024-03-12: Completion Criteria GPU would be 12 nodes (each 4 GPUs) added to NERC OpenShift. Currently 9 GPU nodes are available. 1 to test cluster. 8 to test cluster.

At first on the Test Cluster. This upgrade is scheduled to occur on Tuesday, March 12, 2024, between 9 AM and 12 PM, strategically during the BU spring break (March 9-17) to minimize disruptions. A preliminary installation of one GPU on the Test Cluster is set for this week (March 7th), as a preparatory step.

Timeline needs update

gantt
    title NERC OpenShift Cluster GPU Integration Timeline
    dateFormat  YYYY-MM-DD
    axisFormat  %a %b %d  %Y

    section Events
    Start of BU Spring Break : milestone, 2024-03-09, 1d
    Installation of 9 nodes NVIDIA A100 GPUs : milestone, 2024-03-12, 1d
    End of BU Spring Break : milestone, 2024-03-17, 1d

    section Preparation & Research
    Script for SpinUp : 2024-03-07, 2d
    Project to create workload : 2024-03-07, 2d
    Research Metric possibility : 2024-03-07, 2d

    section Tests
    Spin up multi notebooks with GPUs : 2024-03-12, 4d
    Workload on GPUs : 2024-03-12, 4d
    Monitor GPUs during tests : 2024-03-12, 4d

    section Business As Usual
    Monitor GPUs during classes: 2024-03-17, 5d

Objective

The primary goals of this initiative are to ensure that the newly installed GPUs are:

To validate the GPUs' performance and readiness for future class workloads, particularly for jupyter notebooks running on RHOAI, we will conduct both quantity and quality tests.

Requirements

Completion Criteria

  1. Confirmation that the GPUs can be utilized by notebooks without issues.
  2. Successful mass claiming of GPUs without creating system issues.
  3. Assurance that heavy GPU workloads do not lead to performance or stability issues.
  4. Capability to log, monitor, and measure GPU usage and workload within our OpenShift environment, determining the need for any special configurations or treatments.

Team Members and Responsibilities

Connected or Related Issues

schwesig commented 8 months ago

CC: @hpdempsey

larsks commented 8 months ago

@hpdempsey raised a question this morning about how to ensure that GPU workloads run on nodes that have a GPU.

The production cluster runs the node feature discovery (nfd) operator. This applies labels to nodes based on available hardware features. We're also using the Nvidia gpu feature discovery plugin to generate labels based on GPU resources.

Taking node wrk-89 as an example, this gives us the following labels:

{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "cluster.ocs.openshift.io/openshift-storage": "",
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
  "feature.node.kubernetes.io/cpu-cpuid.HLE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MPX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.RTM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SSE4": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SSE42": "true",
  "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
  "feature.node.kubernetes.io/cpu-cstate.enabled": "true",
  "feature.node.kubernetes.io/cpu-hardware_multithreading": "true",
  "feature.node.kubernetes.io/cpu-pstate.status": "passive",
  "feature.node.kubernetes.io/cpu-pstate.turbo": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTCMT": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTL3CA": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMBA": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMBM": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMON": "true",
  "feature.node.kubernetes.io/kernel-config.DMI": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
  "feature.node.kubernetes.io/kernel-config.X86": "true",
  "feature.node.kubernetes.io/kernel-selinux.enabled": "true",
  "feature.node.kubernetes.io/kernel-version.full": "5.14.0-284.32.1.el9_2.x86_64",
  "feature.node.kubernetes.io/kernel-version.major": "5",
  "feature.node.kubernetes.io/kernel-version.minor": "14",
  "feature.node.kubernetes.io/kernel-version.revision": "0",
  "feature.node.kubernetes.io/memory-numa": "true",
  "feature.node.kubernetes.io/pci-0200_14e4.present": "true",
  "feature.node.kubernetes.io/pci-0300_102b.present": "true",
  "feature.node.kubernetes.io/pci-0302_10de.present": "true",
  "feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
  "feature.node.kubernetes.io/system-os_release.ID": "rhcos",
  "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.13",
  "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "413.92.202309112228-0",
  "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.2",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.13",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "13",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "wrk-89",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/worker": "",
  "node.openshift.io/os_id": "rhcos",
  "nvidia.com/cuda.driver.major": "535",
  "nvidia.com/cuda.driver.minor": "129",
  "nvidia.com/cuda.driver.rev": "03",
  "nvidia.com/cuda.runtime.major": "12",
  "nvidia.com/cuda.runtime.minor": "2",
  "nvidia.com/gfd.timestamp": "1708636251",
  "nvidia.com/gpu-driver-upgrade-state": "upgrade-done",
  "nvidia.com/gpu.compute.major": "7",
  "nvidia.com/gpu.compute.minor": "0",
  "nvidia.com/gpu.count": "1",
  "nvidia.com/gpu.deploy.container-toolkit": "true",
  "nvidia.com/gpu.deploy.dcgm": "true",
  "nvidia.com/gpu.deploy.dcgm-exporter": "true",
  "nvidia.com/gpu.deploy.device-plugin": "true",
  "nvidia.com/gpu.deploy.driver": "true",
  "nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
  "nvidia.com/gpu.deploy.node-status-exporter": "true",
  "nvidia.com/gpu.deploy.nvsm": "",
  "nvidia.com/gpu.deploy.operator-validator": "true",
  "nvidia.com/gpu.family": "volta",
  "nvidia.com/gpu.machine": "PowerEdge-R740xd",
  "nvidia.com/gpu.memory": "32768",
  "nvidia.com/gpu.present": "true",
  "nvidia.com/gpu.product": "Tesla-V100-PCIE-32GB",
  "nvidia.com/gpu.replicas": "1",
  "nvidia.com/mig.capable": "false",
  "nvidia.com/mig.strategy": "single"
}

We can use these labels to control the placement of pods through node affinity rules. For example, to ensure that Pods in a Deployment run only on nodes that have a GPU (of any sort), we might write:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: affinity-example
spec:
  replicas: 1
  template:
    spec:
      affinity:
          nodeAffinity: 
            requiredDuringSchedulingIgnoredDuringExecution: 
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.present
                  operator: In
                  values:
                  - "true"
      containers:
        - image: docker.io/traefik/whoami:latest
          imagePullPolicy: Always
          name: whoami
          env:
            - name: WHOAMI_PORT_NUMBER
              value: "8080"
          ports:
            - name: http
              protocol: TCP
              containerPort: 8080

To limit node selections to nodes with a particular GPU model, we could modify our affinity configuration to use more specific selectors, such as gpu.family or gpu.product:

      affinity:
          nodeAffinity: 
            requiredDuringSchedulingIgnoredDuringExecution: 
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.product
                  operator: In
                  values:
                  - "Tesla-V100-PCIE-32GB"
DanNiESh commented 8 months ago

Question on quantity tests:

  1. which image should we use to create the notebook? CUDA image?
  2. Do we need to test that GPUs are running properly on notebooks after they are launched?
DanNiESh commented 8 months ago

PR on quantity test: https://github.com/OCP-on-NERC/ope-tests/pull/3 Update: currently GPU can not be shared among notebooks on prod cluster. Each notebook claims one GPU. We need to figure out how to support multi-instance GPU: https://www.redhat.com/en/blog/multi-instance-gpu-support-with-the-gpu-operator-v1.7.0 (@larsks pointed out that we can test it on test cluster)

schwesig commented 8 months ago

@DanNiESh

  1. which image should we use to create the notebook? CUDA image?

IMHO yes: but I am not sure if they can be used for @computate quality tests @larsks, @dystewart: can we use the Base OPE Image also for attaching GPUs? https://developer.nvidia.com/cuda-zone


  1. Do we need to test that GPUs are running properly on notebooks after they are launched?

For the basic quantity test, no. This should be done by the quality test by Chris. BUT: if there is an already known and easy way to put some load/calculation on the GPU when spinning up the image, to have some metrics, that would be a nice to have.

schwesig commented 8 months ago

@DanNiESh

PR on quantity test: OCP-on-NERC/ope-tests#3 ... We need to figure out how to support multi-instance GPU...

computate commented 8 months ago

@schwesig I will use the Tensorflow image with GPUs if possible for my tests.

schwesig commented 8 months ago

current status 2024-03-13:

jtriley commented 7 months ago

Just noting I added an additional 4xA100 GPU node to the prod cluster (wrk-90)

schwesig commented 7 months ago

current status 2024-03-19:

schwesig commented 7 months ago

last bullet point done Documentation and approval (here in this issue and connected issues/PRs)