schwesig commented 8 months ago

Tracking

[x] https://github.com/nerc-project/operations/issues/470
[x] https://github.com/nerc-project/operations/issues/442
[x] https://github.com/nerc-project/operations/issues/450
[x] https://github.com/nerc-project/operations/issues/468
[x] 9 GPUs added, 8 in prod, 1 in test
[x] https://github.com/OCP-on-NERC/nerc-ocp-config/pull/397
[x] Prepare/create quantity test (spin up 200 notebooks) #3
[x] Prepare/create quality test (workload for GPU on one/a few notebooks) Tensor Flow Image
[x] optional/possible? testing multi, single, none setting 468
[x] #484
[x] https://github.com/nerc-project/operations/issues/481
[x] https://github.com/nerc-project/operations/issues/482
[x] Success and/or repeat
[x] Fine tune/document for next event like this
[x] Documentation and approval (here in this issue and connected issues/PRs)

Follow ups:

Motivation

To enhance the capabilities of the NERC OpenShift cluster, we plan to add NVIDIA A100 SXM4 GPUs to the infrastructure.

Update 2024-03-12: Completion Criteria GPU would be 12 nodes (each 4 GPUs) added to NERC OpenShift. Currently 9 GPU nodes are available. 1 to test cluster. 8 to test cluster.

~~At first on the Test Cluster.~~ This upgrade is scheduled to occur on Tuesday, March 12, 2024, between 9 AM and 12 PM, strategically during the BU spring break (March 9-17) to minimize disruptions. ~~A preliminary installation of one GPU on the Test Cluster is set for this week (March 7th), as a preparatory step.~~

Timeline needs update

gantt
    title NERC OpenShift Cluster GPU Integration Timeline
    dateFormat  YYYY-MM-DD
    axisFormat  %a %b %d  %Y

    section Events
    Start of BU Spring Break : milestone, 2024-03-09, 1d
    Installation of 9 nodes NVIDIA A100 GPUs : milestone, 2024-03-12, 1d
    End of BU Spring Break : milestone, 2024-03-17, 1d

    section Preparation & Research
    Script for SpinUp : 2024-03-07, 2d
    Project to create workload : 2024-03-07, 2d
    Research Metric possibility : 2024-03-07, 2d

    section Tests
    Spin up multi notebooks with GPUs : 2024-03-12, 4d
    Workload on GPUs : 2024-03-12, 4d
    Monitor GPUs during tests : 2024-03-12, 4d

    section Business As Usual
    Monitor GPUs during classes: 2024-03-17, 5d

Objective

The primary goals of this initiative are to ensure that the newly installed GPUs are:

Functioning correctly within the cluster.
Identifiable and accessible for computational tasks.
Integrated into our Metrics and Observability Cluster for monitoring and potential auto-alerting.

To validate the GPUs' performance and readiness for future class workloads, particularly for jupyter notebooks running on RHOAI, we will conduct both quantity and quality tests.

Requirements

Testing: Implement comprehensive testing procedures:
- Quantity Test: Similar to previous tests, spin up 200 (or more) jupyter notebooks automatically to claim GPUs. Danni & Dylan
- Quality Test: Subject the GPUs to high workload/calculations, initially on one or two jupyter notebooks to assess performance under stress. Chris
Coordination: Ensure Chris, Danni, and Dylan coordinate to prevent test overlap, with a possibility of a combined stress test if initial tests are successful.
Integration and Monitoring: Verify the integration of GPUs with our Observability Cluster for monitoring and logging. Explore the potential use of ColdFront for GPU management. Thorsten

Completion Criteria

Confirmation that the GPUs can be utilized by notebooks without issues.
Successful mass claiming of GPUs without creating system issues.
Assurance that heavy GPU workloads do not lead to performance or stability issues.
Capability to log, monitor, and measure GPU usage and workload within our OpenShift environment, determining the need for any special configurations or treatments.

Team Members and Responsibilities

Danni & Dylan: Mass notebook spinning up and GPU claiming tests.
Chris: Quality workload tests on the GPUs.
Thorsten: Integration of GPU metrics and logging.
Lars: Serves as the GPU integration expert, offering insights and experience throughout the project.

Connected or Related Issues

schwesig commented 8 months ago

CC: @hpdempsey

larsks commented 8 months ago

@hpdempsey raised a question this morning about how to ensure that GPU workloads run on nodes that have a GPU.

The production cluster runs the node feature discovery (nfd) operator. This applies labels to nodes based on available hardware features. We're also using the Nvidia gpu feature discovery plugin to generate labels based on GPU resources.

Taking node wrk-89 as an example, this gives us the following labels:

{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "cluster.ocs.openshift.io/openshift-storage": "",
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
  "feature.node.kubernetes.io/cpu-cpuid.HLE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MPX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.RTM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SSE4": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SSE42": "true",
  "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
  "feature.node.kubernetes.io/cpu-cstate.enabled": "true",
  "feature.node.kubernetes.io/cpu-hardware_multithreading": "true",
  "feature.node.kubernetes.io/cpu-pstate.status": "passive",
  "feature.node.kubernetes.io/cpu-pstate.turbo": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTCMT": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTL3CA": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMBA": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMBM": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMON": "true",
  "feature.node.kubernetes.io/kernel-config.DMI": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
  "feature.node.kubernetes.io/kernel-config.X86": "true",
  "feature.node.kubernetes.io/kernel-selinux.enabled": "true",
  "feature.node.kubernetes.io/kernel-version.full": "5.14.0-284.32.1.el9_2.x86_64",
  "feature.node.kubernetes.io/kernel-version.major": "5",
  "feature.node.kubernetes.io/kernel-version.minor": "14",
  "feature.node.kubernetes.io/kernel-version.revision": "0",
  "feature.node.kubernetes.io/memory-numa": "true",
  "feature.node.kubernetes.io/pci-0200_14e4.present": "true",
  "feature.node.kubernetes.io/pci-0300_102b.present": "true",
  "feature.node.kubernetes.io/pci-0302_10de.present": "true",
  "feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
  "feature.node.kubernetes.io/system-os_release.ID": "rhcos",
  "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.13",
  "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "413.92.202309112228-0",
  "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.2",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.13",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "13",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "wrk-89",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/worker": "",
  "node.openshift.io/os_id": "rhcos",
  "nvidia.com/cuda.driver.major": "535",
  "nvidia.com/cuda.driver.minor": "129",
  "nvidia.com/cuda.driver.rev": "03",
  "nvidia.com/cuda.runtime.major": "12",
  "nvidia.com/cuda.runtime.minor": "2",
  "nvidia.com/gfd.timestamp": "1708636251",
  "nvidia.com/gpu-driver-upgrade-state": "upgrade-done",
  "nvidia.com/gpu.compute.major": "7",
  "nvidia.com/gpu.compute.minor": "0",
  "nvidia.com/gpu.count": "1",
  "nvidia.com/gpu.deploy.container-toolkit": "true",
  "nvidia.com/gpu.deploy.dcgm": "true",
  "nvidia.com/gpu.deploy.dcgm-exporter": "true",
  "nvidia.com/gpu.deploy.device-plugin": "true",
  "nvidia.com/gpu.deploy.driver": "true",
  "nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
  "nvidia.com/gpu.deploy.node-status-exporter": "true",
  "nvidia.com/gpu.deploy.nvsm": "",
  "nvidia.com/gpu.deploy.operator-validator": "true",
  "nvidia.com/gpu.family": "volta",
  "nvidia.com/gpu.machine": "PowerEdge-R740xd",
  "nvidia.com/gpu.memory": "32768",
  "nvidia.com/gpu.present": "true",
  "nvidia.com/gpu.product": "Tesla-V100-PCIE-32GB",
  "nvidia.com/gpu.replicas": "1",
  "nvidia.com/mig.capable": "false",
  "nvidia.com/mig.strategy": "single"
}

We can use these labels to control the placement of pods through node affinity rules. For example, to ensure that Pods in a Deployment run only on nodes that have a GPU (of any sort), we might write:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: affinity-example
spec:
  replicas: 1
  template:
    spec:
      affinity:
          nodeAffinity: 
            requiredDuringSchedulingIgnoredDuringExecution: 
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.present
                  operator: In
                  values:
                  - "true"
      containers:
        - image: docker.io/traefik/whoami:latest
          imagePullPolicy: Always
          name: whoami
          env:
            - name: WHOAMI_PORT_NUMBER
              value: "8080"
          ports:
            - name: http
              protocol: TCP
              containerPort: 8080

To limit node selections to nodes with a particular GPU model, we could modify our affinity configuration to use more specific selectors, such as gpu.family or gpu.product:

      affinity:
          nodeAffinity: 
            requiredDuringSchedulingIgnoredDuringExecution: 
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.product
                  operator: In
                  values:
                  - "Tesla-V100-PCIE-32GB"

DanNiESh commented 8 months ago

Question on quantity tests:

which image should we use to create the notebook? CUDA image?
Do we need to test that GPUs are running properly on notebooks after they are launched?

DanNiESh commented 8 months ago

PR on quantity test: https://github.com/OCP-on-NERC/ope-tests/pull/3 Update: currently GPU can not be shared among notebooks on prod cluster. Each notebook claims one GPU. We need to figure out how to support multi-instance GPU: https://www.redhat.com/en/blog/multi-instance-gpu-support-with-the-gpu-operator-v1.7.0 (@larsks pointed out that we can test it on test cluster)

schwesig commented 8 months ago

@DanNiESh

which image should we use to create the notebook? CUDA image?

IMHO yes: but I am not sure if they can be used for @computate quality tests @larsks, @dystewart: can we use the Base OPE Image also for attaching GPUs? https://developer.nvidia.com/cuda-zone

Do we need to test that GPUs are running properly on notebooks after they are launched?

For the basic quantity test, no. This should be done by the quality test by Chris. BUT: if there is an already known and easy way to put some load/calculation on the GPU when spinning up the image, to have some metrics, that would be a nice to have.

schwesig commented 8 months ago

@DanNiESh

PR on quantity test: OCP-on-NERC/ope-tests#3 ... We need to figure out how to support multi-instance GPU...

https://github.com/nerc-project/operations/issues/468

computate commented 8 months ago

@schwesig I will use the Tensorflow image with GPUs if possible for my tests.

schwesig commented 8 months ago

current status 2024-03-13:

test cluster
- 1 GPU node - 4 GPUs
prod cluster
- 8 GPU nodes - 4 GPUs per node
- 2 GPU nodes - 1 GPU per node (old, wrk-89, wrk-88)

jtriley commented 7 months ago

Just noting I added an additional 4xA100 GPU node to the prod cluster (wrk-90)

schwesig commented 7 months ago

current status 2024-03-19:

test cluster
- 1 GPU node - 4 GPUs
prod cluster
- 9 GPU nodes - 4 GPUs per node
- 2 GPU nodes - 1 GPU per node (old, wrk-89, wrk-88)

schwesig commented 7 months ago

last bullet point done Documentation and approval (here in this issue and connected issues/PRs)

nerc-project / operations

Testing and Observing NVIDIA A100 GPUs in NERC OpenShift Cluster #466

Tracking

Follow ups:

Motivation

Objective

Requirements

Completion Criteria

Team Members and Responsibilities

Connected or Related Issues