nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

follow up: New NVIDIA A100 GPUs - Quality Test #504

Closed schwesig closed 1 month ago

schwesig commented 3 months ago

follow up: New NVIDIA A100 GPUs - Quality Test

nice to have/ follow up for https://github.com/nerc-project/operations/issues/482 feel free to participate in this testing, sharing experiences and results. if help is wanted/needed for observing the tests... contact @schwesig

We are planning to conduct a quality test for the newly installed NVIDIA A100 GPUs by running a newly developed "Tensorflow Jupyter CUDA" image, designed to test the computing power within our OpenShift AI environment; focusing on their performance and compatibility. This test will utilize the new Tensorflow Jupyter CUDA image with 02_model_training_basics.ipynb.

This test does not need to be exclusive for this image/script. If anything is missing or there are new scripts or images useful, feel free.

Test Objectives:

Test Environments :

follow up, nice to have:

Procedure:

  1. Conduct a series of TensorFlow jobs to test GPU performance.
  2. Monitor system stability, noting any crashes, errors.
  3. Deploy the "Tensorflow Jupyter CUDA" image on the Prod Cluster.
  4. Repeat the testing procedure on the Prod Cluster, ensuring consistency and reliability on different clusters.
  5. Document any issues encountered and the outcomes of the tests for both clusters.

This quality test aims to confirm that the new NVIDIA A100 GPUs are working and can be used for upcoming classes and projects.

schwesig commented 1 month ago

closed to follow up on the more reusable #534 idea