Open computate opened 1 month ago
The pod that is failing is on wrk-88 if that makes a difference, the node looks Ready.
This is also happening in the nerc-ocp-test
cluster now with our only GPU there. I'm pretty sure that means we can't use the GPU because the drivers are not installed on the node wrk-3
because of this.
$ oc -n nvidia-gpu-operator logs -l app=nvidia-device-plugin-validator -c plugin-validation
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]
I created this Red Hat support case to address this.
A google of the error code led me to this: https://stackoverflow.com/questions/3253257/cuda-driver-version-is-insufficient-for-cuda-runtime-version
Which has this comment: "-> CUDA driver version is insufficient for CUDA runtime version
But this error is misleading, by selecting back the NVIDIA(Performance mode) with nvidia-settings utility the problem disappears.
It is not a version problem."
Are the GPUs in power saving mode?
@StHeck yeah you're right, just confirmed it's not a version error: https://docs.nvidia.com/deploy/cuda-compatibility/#cuda-11-and-later-defaults-to-minor-version-compatibility
According to nvidia-smi we have a valid config:
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
Looked at all workloads on wrk-3 node in the nerc-ocp-test cluster, to try and rule out race conditions preventing the GPU operator validator from starting properly. No workloads are competing for the GPU however.
Tried deleting and replacing the clusterPolicy with the following specs:
{
"apiVersion": "nvidia.com/v1",
"kind": "ClusterPolicy",
"metadata": {
"name": "gpu-cluster-policy"
},
"spec": {
"operator": {
"defaultRuntime": "crio",
"use_ocp_driver_toolkit": true,
"initContainer": {}
},
"sandboxWorkloads": {
"enabled": false,
"defaultWorkload": "container"
},
"driver": {
"enabled": true,
"useNvidiaDriverCRD": false,
"useOpenKernelModules": false,
"upgradePolicy": {
"autoUpgrade": true,
"drain": {
"deleteEmptyDir": false,
"enable": false,
"force": false,
"timeoutSeconds": 300
},
"maxParallelUpgrades": 1,
"maxUnavailable": "25%",
"podDeletion": {
"deleteEmptyDir": false,
"force": false,
"timeoutSeconds": 300
},
"waitForCompletion": {
"timeoutSeconds": 0
}
},
"repoConfig": {
"configMapName": ""
},
"certConfig": {
"name": ""
},
"licensingConfig": {
"nlsEnabled": true,
"configMapName": ""
},
"virtualTopology": {
"config": ""
},
"kernelModuleConfig": {
"name": ""
}
},
"dcgmExporter": {
"enabled": true,
"config": {
"name": ""
},
"serviceMonitor": {
"enabled": true
}
},
"dcgm": {
"enabled": true
},
"daemonsets": {
"updateStrategy": "RollingUpdate",
"rollingUpdate": {
"maxUnavailable": "1"
}
},
"devicePlugin": {
"enabled": true,
"config": {
"name": "",
"default": ""
},
"mps": {
"root": "/run/nvidia/mps"
}
},
"gfd": {
"enabled": true
},
"migManager": {
"enabled": true
},
"nodeStatusExporter": {
"enabled": true
},
"mig": {
"strategy": "single"
},
"toolkit": {
"enabled": true
},
"validator": {
"plugin": {
"env": [
{
"name": "WITH_WORKLOAD",
"value": "false"
}
]
}
},
"vgpuManager": {
"enabled": false
},
"vgpuDeviceManager": {
"enabled": true
},
"sandboxDevicePlugin": {
"enabled": true
},
"vfioManager": {
"enabled": true
},
"gds": {
"enabled": false
},
"gdrcopy": {
"enabled": false
}
}
}
No luck with the clusterPolicy. We have passed along the must gather to NVIDIA and are awaiting a response from them.
FYI I have disabled auto-sync within ArgoCD while we play around with these resources.
Looks like this is also related: https://github.com/nerc-project/operations/issues/782
@computate can we remove " in nerc-ocp-prod" from the title since now we see the same issue in nerc-ocp-test?
Done changing the title @joachimweyl
as of today 2024-10-24 12:55 ET nvidia operator version and last update call with @tssala23 @dystewart @schwesig
cluster | version | last update | error | machine config update | model failing | nodes |
---|---|---|---|---|---|---|
prod | 24.6.2 | Sep 25. | yes | no | A100 yes, V100 no | wrk-97 |
test-2 (kruize) | 24.3.0 | Oct 21. | yes | yes Oct 20. | A100 yes, V100 n/a | MOC-R8PAC23U39 |
beta | 24.6.2 | no | yes Oct 4. | A100 no, V100 n/a | ||
test | 24.6.2 | yes | no | A100 yes, V100 yes | MOC-R8PAC23U27, |
vague error message
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! [Vector addition of 50000 elements]
idea for next step: @Taj Salawu removing and readding a GPU node on kruize (test-2) to try a fresh restart
@schwesig do we know what nodes specifically are having these issues? Can you update the table above to include a column for node names?
@computate how many GPUs were out of order? am I correct that they were out of order from Oct 11th - 31st?
@joachimweyl Correct about Oct 11th - 31st. I understand there were 2 GPU nodes broken on the test cluster (wrk-3, wrk-4), 2 GPU nodes broken on the prod cluster (wrk-97, wrk-99), and 1 GPU node broken with 4 GPU slices affected on cluster test-2 (wrk-5).
@computate do we know which ones were V100 and which were A100? Actually what would be the most helpful is if we know the node name such as test was using MOC-R8PAC23U27 & test-2 was using MOC-R8PAC23U39, do we know what Prod was using?
Running with these gpu-cluster-policy settings in prod is allowing the drivers to be installed on 9/11 nodes so far.
manager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: "true"
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "true"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "true"
repoConfig:
configMapName: ""
upgradePolicy:
autoUpgrade: false
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: true
force: true
timeoutSeconds: 300
Here is the list of nodes where the NVIDIA driver is successfully applied now.
$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep Running
nvidia-operator-validator-697f5 1/1 Running 0 116m 10.131.23.245 wrk-99 <none> <none>
nvidia-operator-validator-8nm2q 1/1 Running 0 117m 10.130.19.118 wrk-105 <none> <none>
nvidia-operator-validator-dd6xw 1/1 Running 0 116m 10.129.19.48 wrk-102 <none> <none>
nvidia-operator-validator-gvlgd 1/1 Running 0 115m 10.128.23.147 wrk-107 <none> <none>
nvidia-operator-validator-kqdwr 1/1 Running 0 117m 10.130.24.128 wrk-108 <none> <none>
nvidia-operator-validator-lpsvx 1/1 Running 0 116m 10.129.23.245 wrk-106 <none> <none>
nvidia-operator-validator-n2pbh 1/1 Running 0 112m 10.130.12.56 wrk-88 <none> <none>
nvidia-operator-validator-ndnkc 1/1 Running 0 116m 10.129.24.193 wrk-104 <none> <none>
nvidia-operator-validator-z8zs8 1/1 Running 0 117m 10.128.24.144 wrk-103 <none> <none>
Currently the NVIDIA driver is failing on wrk-97
, but completed on wrk-89
.
oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-driver-daemonset | grep -v Running
nvidia-driver-daemonset-415.92.202407191425-0-l6mtg 0/2 Init:CrashLoopBackOff 4 (16s ago) 7m2s 10.129.21.35 wrk-97 <none> <none>
The NVIDIA driver validation is failing on wrk-97
and wrk-89
.
$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep -v Running
nvidia-operator-validator-fsbfj 0/1 Init:0/4 0 71s 10.129.21.78 wrk-97 <none> <none>
nvidia-operator-validator-hr5pg 0/1 Init:3/4 19 (4m21s ago) 115m 10.131.11.208 wrk-89 <none> <none>
Here are some useful commands I learned for viewing GPU utilization.
$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
0
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
0
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-89) -n nvidia-gpu-operator -- nvidia-smi
Thu Nov 7 18:48:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-32GB On | 00000000:3B:00.0 Off | 0 |
| N/A 35C P0 25W / 250W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$
$
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-97) -n nvidia-gpu-operator -- nvidia-smi
error: unable to upgrade connection: container not found ("nvidia-driver-ctr")
Also this command:
$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-97
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-r4czn 0/1 Init:1/2 0 3m57s
nvidia-container-toolkit-daemonset-5zwpj 0/1 Init:0/1 0 3m57s
nvidia-dcgm-exporter-qcsmc 1/1 Running 0 3m57s
nvidia-dcgm-g4kgc 1/1 Running 0 3m57s
nvidia-device-plugin-daemonset-7gf6n 1/1 Running 0 3m57s
nvidia-driver-daemonset-415.92.202407191425-0-m6hkg 0/2 Init:CrashLoopBackOff 60 (3m57s ago) 5h42m
nvidia-mig-manager-k59gx 1/1 Running 0 3m57s
nvidia-node-status-exporter-schzg 1/1 Running 0 22h
nvidia-operator-validator-5kbxt 0/1 Init:0/4 0 3m57s
$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-p4pcz 1/1 Running 0 22h
nvidia-container-toolkit-daemonset-c49lj 1/1 Running 0 22h
nvidia-cuda-validator-9vsjg 0/1 Completed 0 22h
nvidia-dcgm-exporter-7xkdt 1/1 Running 0 22h
nvidia-dcgm-twrj9 1/1 Running 0 22h
nvidia-device-plugin-daemonset-gjm67 1/1 Running 0 22h
nvidia-device-plugin-validator-b25kp 0/1 UnexpectedAdmissionError 0 6m48s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs 2/2 Running 0 22h
nvidia-node-status-exporter-zzrdg 1/1 Running 0 22h
nvidia-operator-validator-hr5pg 0/1 Init:CrashLoopBackOff 200 (107s ago) 22h
$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-p4pcz 1/1 Running 0 22h
nvidia-container-toolkit-daemonset-c49lj 1/1 Running 0 22h
nvidia-cuda-validator-9vsjg 0/1 Completed 0 22h
nvidia-dcgm-exporter-7xkdt 1/1 Running 0 22h
nvidia-dcgm-twrj9 1/1 Running 0 22h
nvidia-device-plugin-daemonset-gjm67 1/1 Running 0 22h
nvidia-device-plugin-validator-frd7l 0/1 UnexpectedAdmissionError 0 3m27s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs 2/2 Running 0 22h
nvidia-node-status-exporter-zzrdg 1/1 Running 0 22h
nvidia-operator-validator-hr5pg 0/1 Init:3/4 201 (6m16s ago) 22h
$
$
$ oc describe pod -n nvidia-gpu-operator nvidia-device-plugin-validator-frd7l
Name: nvidia-device-plugin-validator-frd7l
Namespace: nvidia-gpu-operator
Priority: 0
Service Account: nvidia-operator-validator
Node: wrk-89/
Start Time: Thu, 07 Nov 2024 11:54:27 -0700
Labels: app=nvidia-device-plugin-validator
Annotations: openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod was rejected: Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
IP:
IPs: <none>
Controlled By: ClusterPolicy/gpu-cluster-policy
Init Containers:
plugin-validation:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
vectorAdd
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Containers:
nvidia-device-plugin-validator:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo device-plugin workload validation is successful
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Volumes:
kube-api-access-8h2dn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning UnexpectedAdmissionError 3m40s kubelet Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
@computate what is the status of this fix?
from RH case: - dir NVIDIA share an RCA yet? (context for question, sharing this with a Lenovo engineer.)
Steps taken:
Cluster policy and nvidia-gpu-operator deleted nvidia-gpu-operator reinstalled with below Cluster policy settings:
toolkit: env:
The validator pods are now running.
$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-5nxpm 1/1 Running 0 7m18s gpu-feature-discovery-vcqrx 1/1 Running 0 3m28s gpu-operator-5f584db4b9-9lzsz 1/1 Running 0 9m25s nvidia-container-toolkit-daemonset-cxnfz 1/1 Running 0 7m18s nvidia-container-toolkit-daemonset-vjwcr 1/1 Running 0 7m11s nvidia-cuda-validator-8zbmq 0/1 Completed 0 3m18s nvidia-cuda-validator-pmm4b 0/1 Completed 0 5m2s nvidia-dcgm-exporter-58l54 1/1 Running 0 3m28s nvidia-dcgm-exporter-spmqf 1/1 Running 0 7m18s nvidia-dcgm-kqmbt 1/1 Running 0 7m18s nvidia-dcgm-kzvgx 1/1 Running 0 3m28s nvidia-device-plugin-daemonset-m2x68 1/1 Running 2 (5m16s ago) 7m18s nvidia-device-plugin-daemonset-sx6sq 1/1 Running 0 3m28s nvidia-device-plugin-validator-cqxtn 0/1 Completed 0 4m50s nvidia-device-plugin-validator-q6rz7 0/1 Completed 0 2m57s nvidia-driver-daemonset-415.92.202408100433-0-47wt6 2/2 Running 0 7m59s nvidia-driver-daemonset-415.92.202408100433-0-7c586 2/2 Running 0 7m59s nvidia-mig-manager-jttm9 1/1 Running 0 7m10s nvidia-node-status-exporter-rtcjz 1/1 Running 0 7m55s nvidia-node-status-exporter-wbvcg 1/1 Running 0 7m55s nvidia-operator-validator-6fd9k 1/1 Running 0 7m11s nvidia-operator-validator-cpcwk 1/1 Running 0 7m18s
Nvidia will share an RCA soon, I will keep you posted.
I was trying to figure out why the wrk-99 node in nerc-ocp-prod has 4 GPU devices, but the GPU utilization for wrk-99 on nerc-ocp-prod is not available.
I found that the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-prod is in an
OperandNotReady
status, but the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-test is in aReady
state.I also noticed that the
plugin-validation
container of thenvidia-operator-validator-nnjjx
pod in thenvidia-gpu-operator
namespace, this container is not becoming ready and has a repeated erro in the log:time="2024-10-11T15:35:31Z" level=info msg="pod nvidia-device-plugin-validator-6sb75 is curently in Failed phase"
I don't know the reason for this error with the gpu-cluster-policy in nerc-ocp-prod.