nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

New NVIDIA A100 GPUs - Quantity Test #481

Closed schwesig closed 6 months ago

schwesig commented 6 months ago

New NVIDIA A100 GPUs - Quantity Test

We are planning to conduct a quantity test for the newly installed NVIDIA A100 GPUs by spinning up 200 RHODS images with GPU claims. This test will utilize the script provided in the pull request: OCP-on-NERC/ope-tests#3.

Test Objectives:

Test Environments:

Procedure:

  1. Review and merge the script from the pull request if not already done.
  2. Execute the script on the Test Cluster to initiate the spinning up of 200 RHODS images with GPU claims.
  3. Monitor the Test Cluster for any performance issues or failures.
  4. Upon successful completion and verification in the Test Cluster, repeat the procedure on the Prod Cluster.
  5. Document any issues encountered and the outcomes of the tests for both clusters.

This test is critical for assessing our infrastructure's readiness for high-demand GPU workloads and ensuring a smooth user experience for RHODS image deployments with GPU utilization.

schwesig commented 6 months ago

/cc @computate @dystewart @hpdempsey

joachimweyl commented 6 months ago

@schwesig how are the tests going?

DanNiESh commented 6 months ago

We have successfully tested launching 28 Jupyter notebooks each claiming 1 GPU slice.

schwesig commented 6 months ago

https://github.com/nerc-project/operations/issues/466

schwesig commented 6 months ago

https://www.redhat.com/en/blog/a-guide-to-functional-and-performance-testing-of-the-nvidia-dgx-a100 A Guide to Functional and Performance Testing of the NVIDIA DGX A100 In this work, we paid particular attention to the reproducibility of the functional and performance testing, so that the whole testing procedure can be easily re-executed in any freshly deployed OpenShift cluster.

https://www.redhat.com/en/blog/a-guide-to-scaling-openshift-data-science-to-hundreds-of-users-and-notebooks A Guide to Scaling OpenShift Data Science to Hundreds of Users and Notebooks In this blog post, you’ll see how we stress tested OpenShift Data Science notebook spawner to ensure that it can seamlessly support hundreds of simultaneous users.

Using NVIDIA A100’s Multi-Instance GPU to Run Multiple Workloads in Parallel on a Single GPU In the remainder of this post, we go through the performance benchmarking we performed in parallel with this work to better understand the performance of each MIG instance size, and to validate the isolation of workloads running on different MIG instances of the same GPU in an OpenShift worker node.