nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

New NVIDIA A100 GPUs - Quality Test #482

Closed schwesig closed 6 months ago

schwesig commented 6 months ago

New NVIDIA A100 GPUs - Quality Test

We are planning to conduct a quantity test for the newly installed NVIDIA A100 GPUs by spinning up 200 RHODS images with GPU claims. We are planning to conduct a quality test for the newly installed NVIDIA A100 GPUs by running a newly developed "Tensorflow Jupyter CUDA" image, designed to test the computing power within our OpenShift AI environment; focusing on their performance and compatibility. This test will utilize the new Tensorflow Jupyter CUDA image with 02_model_training_basics.ipynb.

This test does not need to be exclusive for this image/script. If anything is missing or there are new scripts or images useful, feel free.

Test Objectives:

Test Environments:

follow up, nice to have in https://github.com/nerc-project/operations/issues/504:

Procedure:

  1. Conduct a series of TensorFlow jobs to test GPU performance.
  2. Monitor system stability, noting any crashes, errors.
  3. Deploy the "Tensorflow Jupyter CUDA" image on the Prod Cluster.
  4. Repeat the testing procedure on the Prod Cluster, ensuring consistency and reliability on different clusters.
  5. Document any issues encountered and the outcomes of the tests for both clusters.

This quality test aims to confirm that the new NVIDIA A100 GPUs are working and can be used for upcoming classes and projects.

schwesig commented 6 months ago

/cc @computate @dystewart @hpdempsey

schwesig commented 6 months ago

I created a smart-village project and namespace with the right labels to do OpenShift AI, and assigned a edit role binding for my computate user to create workbenches.

    modelmesh-enabled: 'true'
    opendatahub.io/dashboard: 'true'

Inside of an OpenShift AI Project, you create a workbench. Once you can create workbenches, you can use my new Tensorflow Jupyter CUDA image I introduced into the test cluster. Give yourself a GPU, and then you can clone this repo https://github.com/Milstein/nerc_rhods_mlops And run through the 02_model_training_basics.ipynb notebook to do what I did.

image

schwesig commented 6 months ago

comment: additional info DCGM_FI_DEV_GPU_UTIL - GPU utilization (in %) should measure in %, the graph shows more of a full jump between 0 to 4 GPUs. to check, if this is because of GPU sharing feature or something else. Maybe if 2 workloads were going at the same time it would share?

joachimweyl commented 6 months ago

@schwesig how are the tests going?

schwesig commented 6 months ago

We were able to make this test above. Unfortunately we couldn't add more tests, other topics came up in between.

To discuss: do we need more of these quantity tests? In general: we are happy with the succesfull test, so moving it to a low priority.

More quality tests would be nice to have, everybody feel free to join.

But for this current, urgent issue, I consider it to be succesful.

Moving it to follow up.

schwesig commented 6 months ago

follow up for nice to have created https://github.com/nerc-project/operations/issues/504

schwesig commented 6 months ago

Here is a test conatiner https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/ Generating a load To generate a load, you must first download DCGM and containerize it. The following script creates a container that can be used to run dcgmproftester. This container is available on the NVIDIA DockerHub repository.

schwesig commented 6 months ago

https://www.redhat.com/en/blog/a-guide-to-functional-and-performance-testing-of-the-nvidia-dgx-a100

schwesig commented 6 months ago

https://github.com/nerc-project/operations/issues/466