Closed dystewart closed 1 week ago
@schwesig
The test cluster already has a GPU host: wrk-10
. That said it appears to have been removed at some point although I don't recall doing this work. Attempting to investigate and get it back into the cluster I discovered the OBM is down so it might not have survived the annual power maintenance. Need to pull in tech square or Lenovo most likely. cc @hakasapl
Actually, looking back in slack the wrk-10
host was removed back on April 4th to be transfered to ESI. That would explain why I can't reach the OBM 🙃 . @hakasapl is that host still being used by ESI currently?
If it's still in use and can't be moved back or used under ESI with NERC nets to add it back to ocp-test then yes we'll need to pull a node from production.
Turns out it wasn't currently in use by ESI. @hakasapl switched the OBM port back to NERC networks and I was able to add the host back to the nerc-ocp-test
cluster:
$ oc get node wrk-10
NAME STATUS ROLES AGE VERSION
wrk-10 Ready worker 2m32s v1.28.9+2f7b992
That said, it appears the node feature discovery operator isn't quite working as expected as I don't see any pods in the openshift-nfd
namespace (should be one nfd-worker pod per node in that namespace). This means the host won't get picked up by the nvidia-gpu-operator and launch pods on that host in the nvidia-gpu-operator
namespace:
$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-76d565bcc-rgnhb 2/2 Running 0 161m
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-567c8fdc87-997v7 1/1 Running 1 20d
Need to look into this further but I suspect there's some issue after upgrading to OpenShift 4.15
That said, it appears the node feature discovery operator isn't quite working as expected as I don't see any pods in the
openshift-nfd
namespace (should be one nfd-worker pod per node in that namespace).
I was able to resolve that issue by applying the following manifests:
The cluster was missing the NodeFeatureDiscovery CR which is why it wasn't running on any of the hosts in the cluster.
Looks like there is an argocd app for that overlay:
I suspect argocd is having some issue applying manifests for the test cluster.
The wrk-10
host should be fully ready for GPU workloads now:
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-415.92.202404251009-0-vtfvl
sh-4.4# nvidia-smi
Fri May 31 14:59:07 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:31:00.0 Off | 0 |
| N/A 23C P0 50W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:4B:00.0 Off | 0 |
| N/A 23C P0 51W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:CA:00.0 Off | 0 |
| N/A 23C P0 48W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:E3:00.0 Off | 0 |
| N/A 23C P0 49W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
@DanNiESh & @dystewart what are the next steps for the creation of the roadmap for GPU testing?
The road map is complete, just need to work on finishing the tasks in the roadmap
With RHOAI upgraded in nerc-ocp-test we need to acquire a gpu node (from the prod cluster) and onboard it back to the test cluster. We will want to keep the testing period short so we'll have to be fairly aggressive on our testing schedule as we'd like to have this done over 2-3 days.
Here's a list of items we need to test and experiment with:
Preparation: (1 week)
GPU and AI: (estimated time to complete: about 2-3 days)
This list is duplicated from part of: https://github.com/nerc-project/operations/issues/547. Duplicating here for further elaboration and potentially adding more tasks