Roadmap for gpu testing with rhoai in the nerc-ocp-test cluster

dystewart commented 1 month ago

With RHOAI upgraded in nerc-ocp-test we need to acquire a gpu node (from the prod cluster) and onboard it back to the test cluster. We will want to keep the testing period short so we'll have to be fairly aggressive on our testing schedule as we'd like to have this done over 2-3 days.

Here's a list of items we need to test and experiment with:

Preparation: (1 week)
- [x] A test ML/AI workload: https://github.com/rh-aiservices-bu/getting-started-with-gpus/tree/main
- [ ] Try and test Python AI tools rebuilt wheels on current cluster
GPU and AI: (estimated time to complete: about 2-3 days)
- [ ] Run the test ML/AI workload to confirm proper gpu functioning
- [ ] Test Python AI tools rebuilt wheels on the new OpenShift cluster. For details, refer to the ongoing project: Rebuilding the Wheel.
- [ ] Once GPUs are ready, test MIG settings with GPU AcceleratorProfile
- [ ] Run the test ML workload to ensure proper MIG operation

This list is duplicated from part of: https://github.com/nerc-project/operations/issues/547. Duplicating here for further elaboration and potentially adding more tasks

schwesig commented 1 month ago

@schwesig

jtriley commented 1 month ago

The test cluster already has a GPU host: wrk-10. That said it appears to have been removed at some point although I don't recall doing this work. Attempting to investigate and get it back into the cluster I discovered the OBM is down so it might not have survived the annual power maintenance. Need to pull in tech square or Lenovo most likely. cc @hakasapl

jtriley commented 1 month ago

Actually, looking back in slack the wrk-10 host was removed back on April 4th to be transfered to ESI. That would explain why I can't reach the OBM 🙃 . @hakasapl is that host still being used by ESI currently?

jtriley commented 1 month ago

If it's still in use and can't be moved back or used under ESI with NERC nets to add it back to ocp-test then yes we'll need to pull a node from production.

jtriley commented 1 month ago

Turns out it wasn't currently in use by ESI. @hakasapl switched the OBM port back to NERC networks and I was able to add the host back to the nerc-ocp-test cluster:

$ oc get node wrk-10
NAME     STATUS   ROLES    AGE     VERSION
wrk-10   Ready    worker   2m32s   v1.28.9+2f7b992

That said, it appears the node feature discovery operator isn't quite working as expected as I don't see any pods in the openshift-nfd namespace (should be one nfd-worker pod per node in that namespace). This means the host won't get picked up by the nvidia-gpu-operator and launch pods on that host in the nvidia-gpu-operator namespace:

$ oc get pods -n openshift-nfd
NAME                                     READY   STATUS    RESTARTS   AGE
nfd-controller-manager-76d565bcc-rgnhb   2/2     Running   0          161m

$ oc get pods -n nvidia-gpu-operator
NAME                            READY   STATUS    RESTARTS   AGE
gpu-operator-567c8fdc87-997v7   1/1     Running   1          20d

Need to look into this further but I suspect there's some issue after upgrading to OpenShift 4.15

jtriley commented 1 month ago

That said, it appears the node feature discovery operator isn't quite working as expected as I don't see any pods in the openshift-nfd namespace (should be one nfd-worker pod per node in that namespace).

I was able to resolve that issue by applying the following manifests:

https://github.com/OCP-on-NERC/nerc-ocp-config/blob/main/nfd-operator/overlays/nerc-ocp-test/kustomization.yaml

The cluster was missing the NodeFeatureDiscovery CR which is why it wasn't running on any of the hosts in the cluster.

Looks like there is an argocd app for that overlay:

https://github.com/OCP-on-NERC/nerc-ocp-apps/blob/main/clusters/nerc-ocp-test/kustomization.yaml#L41-L47

I suspect argocd is having some issue applying manifests for the test cluster.

jtriley commented 1 month ago

The wrk-10 host should be fully ready for GPU workloads now:

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-415.92.202404251009-0-vtfvl
sh-4.4# nvidia-smi
Fri May 31 14:59:07 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:31:00.0 Off |                    0 |
| N/A   23C    P0             50W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:4B:00.0 Off |                    0 |
| N/A   23C    P0             51W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:CA:00.0 Off |                    0 |
| N/A   23C    P0             48W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:E3:00.0 Off |                    0 |
| N/A   23C    P0             49W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

joachimweyl commented 2 weeks ago

@DanNiESh & @dystewart what are the next steps for the creation of the roadmap for GPU testing?

DanNiESh commented 2 weeks ago

Next step: https://github.com/nerc-project/operations/issues/501

DanNiESh commented 1 week ago

The road map is complete, just need to work on finishing the tasks in the roadmap

nerc-project / operations

Roadmap for gpu testing with rhoai in the nerc-ocp-test cluster #589