Closed schwesig closed 6 months ago
/cc @computate @dystewart @hpdempsey
@schwesig how are the tests going?
We have successfully tested launching 28 Jupyter notebooks each claiming 1 GPU slice.
https://www.redhat.com/en/blog/a-guide-to-functional-and-performance-testing-of-the-nvidia-dgx-a100 A Guide to Functional and Performance Testing of the NVIDIA DGX A100 In this work, we paid particular attention to the reproducibility of the functional and performance testing, so that the whole testing procedure can be easily re-executed in any freshly deployed OpenShift cluster.
https://www.redhat.com/en/blog/a-guide-to-scaling-openshift-data-science-to-hundreds-of-users-and-notebooks A Guide to Scaling OpenShift Data Science to Hundreds of Users and Notebooks In this blog post, you’ll see how we stress tested OpenShift Data Science notebook spawner to ensure that it can seamlessly support hundreds of simultaneous users.
Using NVIDIA A100’s Multi-Instance GPU to Run Multiple Workloads in Parallel on a Single GPU In the remainder of this post, we go through the performance benchmarking we performed in parallel with this work to better understand the performance of each MIG instance size, and to validate the isolation of workloads running on different MIG instances of the same GPU in an OpenShift worker node.
New NVIDIA A100 GPUs - Quantity Test
We are planning to conduct a quantity test for the newly installed NVIDIA A100 GPUs by spinning up 200 RHODS images with GPU claims. This test will utilize the script provided in the pull request: OCP-on-NERC/ope-tests#3.
Test Objectives:
Test Environments:
Procedure:
This test is critical for assessing our infrastructure's readiness for high-demand GPU workloads and ensuring a smooth user experience for RHODS image deployments with GPU utilization.