NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters

computate commented 1 month ago

I was trying to figure out why the wrk-99 node in nerc-ocp-prod has 4 GPU devices, but the GPU utilization for wrk-99 on nerc-ocp-prod is not available.

I found that the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-prod is in an OperandNotReady status, but the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-test is in a Ready state.

I also noticed that the plugin-validation container of the nvidia-operator-validator-nnjjx pod in the nvidia-gpu-operator namespace, this container is not becoming ready and has a repeated erro in the log:

time="2024-10-11T15:35:31Z" level=info msg="pod nvidia-device-plugin-validator-6sb75 is curently in Failed phase"

I don't know the reason for this error with the gpu-cluster-policy in nerc-ocp-prod.

computate commented 1 month ago

The pod that is failing is on wrk-88 if that makes a difference, the node looks Ready.

computate commented 1 month ago

This is also happening in the nerc-ocp-test cluster now with our only GPU there. I'm pretty sure that means we can't use the GPU because the drivers are not installed on the node wrk-3 because of this.

$ oc -n nvidia-gpu-operator logs -l app=nvidia-device-plugin-validator -c plugin-validation
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

computate commented 1 month ago

I created this Red Hat support case to address this.

StHeck commented 1 month ago

A google of the error code led me to this: https://stackoverflow.com/questions/3253257/cuda-driver-version-is-insufficient-for-cuda-runtime-version

Which has this comment: "-> CUDA driver version is insufficient for CUDA runtime version

But this error is misleading, by selecting back the NVIDIA(Performance mode) with nvidia-settings utility the problem disappears.

It is not a version problem."

Are the GPUs in power saving mode?

dystewart commented 1 month ago

@StHeck yeah you're right, just confirmed it's not a version error: https://docs.nvidia.com/deploy/cuda-compatibility/#cuda-11-and-later-defaults-to-minor-version-compatibility

According to nvidia-smi we have a valid config:

| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |

dystewart commented 1 month ago

Looked at all workloads on wrk-3 node in the nerc-ocp-test cluster, to try and rule out race conditions preventing the GPU operator validator from starting properly. No workloads are competing for the GPU however.

Tried deleting and replacing the clusterPolicy with the following specs:

{
  "apiVersion": "nvidia.com/v1",
  "kind": "ClusterPolicy",
  "metadata": {
    "name": "gpu-cluster-policy"
  },
  "spec": {
    "operator": {
      "defaultRuntime": "crio",
      "use_ocp_driver_toolkit": true,
      "initContainer": {}
    },
    "sandboxWorkloads": {
      "enabled": false,
      "defaultWorkload": "container"
    },
    "driver": {
      "enabled": true,
      "useNvidiaDriverCRD": false,
      "useOpenKernelModules": false,
      "upgradePolicy": {
        "autoUpgrade": true,
        "drain": {
          "deleteEmptyDir": false,
          "enable": false,
          "force": false,
          "timeoutSeconds": 300
        },
        "maxParallelUpgrades": 1,
        "maxUnavailable": "25%",
        "podDeletion": {
          "deleteEmptyDir": false,
          "force": false,
          "timeoutSeconds": 300
        },
        "waitForCompletion": {
          "timeoutSeconds": 0
        }
      },
      "repoConfig": {
        "configMapName": ""
      },
      "certConfig": {
        "name": ""
      },
      "licensingConfig": {
        "nlsEnabled": true,
        "configMapName": ""
      },
      "virtualTopology": {
        "config": ""
      },
      "kernelModuleConfig": {
        "name": ""
      }
    },
    "dcgmExporter": {
      "enabled": true,
      "config": {
        "name": ""
      },
      "serviceMonitor": {
        "enabled": true
      }
    },
    "dcgm": {
      "enabled": true
    },
    "daemonsets": {
      "updateStrategy": "RollingUpdate",
      "rollingUpdate": {
        "maxUnavailable": "1"
      }
    },
    "devicePlugin": {
      "enabled": true,
      "config": {
        "name": "",
        "default": ""
      },
      "mps": {
        "root": "/run/nvidia/mps"
      }
    },
    "gfd": {
      "enabled": true
    },
    "migManager": {
      "enabled": true
    },
    "nodeStatusExporter": {
      "enabled": true
    },
    "mig": {
      "strategy": "single"
    },
    "toolkit": {
      "enabled": true
    },
    "validator": {
      "plugin": {
        "env": [
          {
            "name": "WITH_WORKLOAD",
            "value": "false"
          }
        ]
      }
    },
    "vgpuManager": {
      "enabled": false
    },
    "vgpuDeviceManager": {
      "enabled": true
    },
    "sandboxDevicePlugin": {
      "enabled": true
    },
    "vfioManager": {
      "enabled": true
    },
    "gds": {
      "enabled": false
    },
    "gdrcopy": {
      "enabled": false
    }
  }
}

No luck with the clusterPolicy. We have passed along the must gather to NVIDIA and are awaiting a response from them.

FYI I have disabled auto-sync within ArgoCD while we play around with these resources.

dystewart commented 1 month ago

Looks like this is also related: https://github.com/nerc-project/operations/issues/782

joachimweyl commented 1 month ago

@computate can we remove " in nerc-ocp-prod" from the title since now we see the same issue in nerc-ocp-test?

computate commented 1 month ago

Done changing the title @joachimweyl

schwesig commented 4 weeks ago

as of today 2024-10-24 12:55 ET nvidia operator version and last update call with @tssala23 @dystewart @schwesig

cluster	version	last update	error	machine config update	model failing	nodes
prod	24.6.2	Sep 25.	yes	no	A100 yes, V100 no	wrk-97
test-2 (kruize)	24.3.0	Oct 21.	yes	yes Oct 20.	A100 yes, V100 n/a	MOC-R8PAC23U39
beta	24.6.2		no	yes Oct 4.	A100 no, V100 n/a
test	24.6.2		yes	no	A100 yes, V100 yes	MOC-R8PAC23U27,

vague error message Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! [Vector addition of 50000 elements]

schwesig commented 4 weeks ago

idea for next step: @Taj Salawu removing and readding a GPU node on kruize (test-2) to try a fresh restart

joachimweyl commented 3 weeks ago

@schwesig do we know what nodes specifically are having these issues? Can you update the table above to include a column for node names?

schwesig commented 3 weeks ago

https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/k8s/cluster/nodes/wrk-5/yaml nerc-ocp-test-2 wrk-5

joachimweyl commented 3 weeks ago

@computate how many GPUs were out of order? am I correct that they were out of order from Oct 11th - 31st?

computate commented 2 weeks ago

@joachimweyl Correct about Oct 11th - 31st. I understand there were 2 GPU nodes broken on the test cluster (wrk-3, wrk-4), 2 GPU nodes broken on the prod cluster (wrk-97, wrk-99), and 1 GPU node broken with 4 GPU slices affected on cluster test-2 (wrk-5).

joachimweyl commented 2 weeks ago

@computate do we know which ones were V100 and which were A100? Actually what would be the most helpful is if we know the node name such as test was using MOC-R8PAC23U27 & test-2 was using MOC-R8PAC23U39, do we know what Prod was using?

schwesig commented 2 weeks ago

kruize
nerc-ocp-test-2.nerc.mghpcc.org applied the changes manually, and it works for them now again

computate commented 2 weeks ago

Running with these gpu-cluster-policy settings in prod is allowing the drivers to be installed on 9/11 nodes so far.

    manager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: true
        force: true
        timeoutSeconds: 300

computate commented 2 weeks ago

Here is the list of nodes where the NVIDIA driver is successfully applied now.

$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep Running
nvidia-operator-validator-697f5                       1/1     Running                    0                116m    10.131.23.245   wrk-99    <none>           <none>
nvidia-operator-validator-8nm2q                       1/1     Running                    0                117m    10.130.19.118   wrk-105   <none>           <none>
nvidia-operator-validator-dd6xw                       1/1     Running                    0                116m    10.129.19.48    wrk-102   <none>           <none>
nvidia-operator-validator-gvlgd                       1/1     Running                    0                115m    10.128.23.147   wrk-107   <none>           <none>
nvidia-operator-validator-kqdwr                       1/1     Running                    0                117m    10.130.24.128   wrk-108   <none>           <none>
nvidia-operator-validator-lpsvx                       1/1     Running                    0                116m    10.129.23.245   wrk-106   <none>           <none>
nvidia-operator-validator-n2pbh                       1/1     Running                    0                112m    10.130.12.56    wrk-88    <none>           <none>
nvidia-operator-validator-ndnkc                       1/1     Running                    0                116m    10.129.24.193   wrk-104   <none>           <none>
nvidia-operator-validator-z8zs8                       1/1     Running                    0                117m    10.128.24.144   wrk-103   <none>           <none>

Currently the NVIDIA driver is failing on wrk-97, but completed on wrk-89.

oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-driver-daemonset | grep -v Running
nvidia-driver-daemonset-415.92.202407191425-0-l6mtg   0/2     Init:CrashLoopBackOff      4 (16s ago)      7m2s   10.129.21.35    wrk-97    <none>           <none>

The NVIDIA driver validation is failing on wrk-97 and wrk-89.

$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep -v Running
nvidia-operator-validator-fsbfj                       0/1     Init:0/4                   0                71s     10.129.21.78    wrk-97    <none>           <none>
nvidia-operator-validator-hr5pg                       0/1     Init:3/4                   19 (4m21s ago)   115m    10.131.11.208   wrk-89    <none>           <none>

computate commented 2 weeks ago

Here are some useful commands I learned for viewing GPU utilization.

$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
0
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
0
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-89) -n nvidia-gpu-operator -- nvidia-smi
Thu Nov  7 18:48:25 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           On  |   00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0             25W /  250W |       4MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ 
$ 
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-97) -n nvidia-gpu-operator -- nvidia-smi
error: unable to upgrade connection: container not found ("nvidia-driver-ctr")

computate commented 2 weeks ago

Also this command:

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-97
NAME                                                  READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-r4czn                           0/1     Init:1/2                0                3m57s
nvidia-container-toolkit-daemonset-5zwpj              0/1     Init:0/1                0                3m57s
nvidia-dcgm-exporter-qcsmc                            1/1     Running                 0                3m57s
nvidia-dcgm-g4kgc                                     1/1     Running                 0                3m57s
nvidia-device-plugin-daemonset-7gf6n                  1/1     Running                 0                3m57s
nvidia-driver-daemonset-415.92.202407191425-0-m6hkg   0/2     Init:CrashLoopBackOff   60 (3m57s ago)   5h42m
nvidia-mig-manager-k59gx                              1/1     Running                 0                3m57s
nvidia-node-status-exporter-schzg                     1/1     Running                 0                22h
nvidia-operator-validator-5kbxt                       0/1     Init:0/4                0                3m57s

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME                                                  READY   STATUS                     RESTARTS         AGE
gpu-feature-discovery-p4pcz                           1/1     Running                    0                22h
nvidia-container-toolkit-daemonset-c49lj              1/1     Running                    0                22h
nvidia-cuda-validator-9vsjg                           0/1     Completed                  0                22h
nvidia-dcgm-exporter-7xkdt                            1/1     Running                    0                22h
nvidia-dcgm-twrj9                                     1/1     Running                    0                22h
nvidia-device-plugin-daemonset-gjm67                  1/1     Running                    0                22h
nvidia-device-plugin-validator-b25kp                  0/1     UnexpectedAdmissionError   0                6m48s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs   2/2     Running                    0                22h
nvidia-node-status-exporter-zzrdg                     1/1     Running                    0                22h
nvidia-operator-validator-hr5pg                       0/1     Init:CrashLoopBackOff      200 (107s ago)   22h

computate commented 2 weeks ago

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME                                                  READY   STATUS                     RESTARTS          AGE
gpu-feature-discovery-p4pcz                           1/1     Running                    0                 22h
nvidia-container-toolkit-daemonset-c49lj              1/1     Running                    0                 22h
nvidia-cuda-validator-9vsjg                           0/1     Completed                  0                 22h
nvidia-dcgm-exporter-7xkdt                            1/1     Running                    0                 22h
nvidia-dcgm-twrj9                                     1/1     Running                    0                 22h
nvidia-device-plugin-daemonset-gjm67                  1/1     Running                    0                 22h
nvidia-device-plugin-validator-frd7l                  0/1     UnexpectedAdmissionError   0                 3m27s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs   2/2     Running                    0                 22h
nvidia-node-status-exporter-zzrdg                     1/1     Running                    0                 22h
nvidia-operator-validator-hr5pg                       0/1     Init:3/4                   201 (6m16s ago)   22h
$ 
$ 
$ oc describe pod -n nvidia-gpu-operator nvidia-device-plugin-validator-frd7l
Name:             nvidia-device-plugin-validator-frd7l
Namespace:        nvidia-gpu-operator
Priority:         0
Service Account:  nvidia-operator-validator
Node:             wrk-89/
Start Time:       Thu, 07 Nov 2024 11:54:27 -0700
Labels:           app=nvidia-device-plugin-validator
Annotations:      openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Failed
Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
IP:               
IPs:              <none>
Controlled By:    ClusterPolicy/gpu-cluster-policy
Init Containers:
  plugin-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      vectorAdd
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Containers:
  nvidia-device-plugin-validator:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      echo device-plugin workload validation is successful
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Volumes:
  kube-api-access-8h2dn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                    Age    From     Message
  ----     ------                    ----   ----     -------
  Warning  UnexpectedAdmissionError  3m40s  kubelet  Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected

joachimweyl commented 3 days ago

@computate what is the status of this fix?

msdisme commented 3 days ago

from RH case: - dir NVIDIA share an RCA yet? (context for question, sharing this with a Lenovo engineer.)

Steps taken:

Cluster policy and nvidia-gpu-operator deleted nvidia-gpu-operator reinstalled with below Cluster policy settings:

toolkit: env:

name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED value: "false"
name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS value: "true" devicePlugin: env:
- name: DEVICE_LIST_STRATEGY value: volume-mounts

The validator pods are now running.

$ oc get pods -n nvidia-gpu-operator NAME gpu-feature-discovery-5nxpm gpu-feature-discovery-vcqrx gpu-operator-5f584db4b9-9lzsz nvidia-container-toolkit-daemonset-cxnfz nvidia-container-toolkit-daemonset-vjwcr nvidia-cuda-validator-8zbmq nvidia-cuda-validator-pmm4b nvidia-dcgm-exporter-58l54 nvidia-dcgm-exporter-spmqf nvidia-dcgm-kqmbt nvidia-dcgm-kzvgx nvidia-device-plugin-daemonset-m2x68 nvidia-device-plugin-daemonset-sx6sq nvidia-device-plugin-validator-cqxtn nvidia-device-plugin-validator-q6rz7 nvidia-driver-daemonset-415.92. nvidia-driver-daemonset-415.92. nvidia-mig-manager-jttm9 nvidia-node-status-exporter-rtcjz nvidia-node-status-exporter-wbvcg nvidia-operator-validator-6fd9k nvidia-operator-validator-cpcwk READY STATUS RESTARTS AGE 1/1 Running 0 7m18s 1/1 Running 0 3m28s 1/1 Running 0 9m25s 1/1 Running 0 7m18s 1/1 Running 0 7m11s 0/1 Completed 0 3m18s 0/1 Completed 0 5m2s 1/1 Running 0 3m28s 1/1 Running 0 7m18s 1/1 Running 0 7m18s 1/1 Running 0 3m28s 1/1 Running 2 (5m16s ago) 7m18s 1/1 Running 0 3m28s 0/1 Completed 0 4m50s 0/1 Completed 0 2m57s 202408100433-0-47wt6 2/2 Running 0 7m59s 202408100433-0-7c586 2/2 Running 0 7m59s 1/1 Running 0 7m10s 1/1 Running 0 7m55s 1/1 Running 0 7m55s 1/1 Running 0 7m11s 1/1 Running 0 7m18s

Nvidia will share an RCA soon, I will keep you posted.

nerc-project / operations

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768