Closed schwesig closed 6 months ago
/cc @computate @dystewart @hpdempsey
I created a smart-village project and namespace with the right labels to do OpenShift AI, and assigned a edit role binding for my computate user to create workbenches.
modelmesh-enabled: 'true'
opendatahub.io/dashboard: 'true'
Inside of an OpenShift AI Project, you create a workbench. Once you can create workbenches, you can use my new Tensorflow Jupyter CUDA image I introduced into the test cluster. Give yourself a GPU, and then you can clone this repo https://github.com/Milstein/nerc_rhods_mlops And run through the 02_model_training_basics.ipynb notebook to do what I did.
comment: additional info DCGM_FI_DEV_GPU_UTIL - GPU utilization (in %) should measure in %, the graph shows more of a full jump between 0 to 4 GPUs. to check, if this is because of GPU sharing feature or something else. Maybe if 2 workloads were going at the same time it would share?
@schwesig how are the tests going?
We were able to make this test above. Unfortunately we couldn't add more tests, other topics came up in between.
To discuss: do we need more of these quantity tests? In general: we are happy with the succesfull test, so moving it to a low priority.
More quality tests would be nice to have, everybody feel free to join.
But for this current, urgent issue, I consider it to be succesful.
Moving it to follow up.
follow up for nice to have created https://github.com/nerc-project/operations/issues/504
Here is a test conatiner https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/ Generating a load To generate a load, you must first download DCGM and containerize it. The following script creates a container that can be used to run dcgmproftester. This container is available on the NVIDIA DockerHub repository.
New NVIDIA A100 GPUs - Quality Test
We are planning to conduct a quantity test for the newly installed NVIDIA A100 GPUs by spinning up 200 RHODS images with GPU claims. We are planning to conduct a quality test for the newly installed NVIDIA A100 GPUs by running a newly developed "Tensorflow Jupyter CUDA" image, designed to test the computing power within our OpenShift AI environment; focusing on their performance and compatibility. This test will utilize the new Tensorflow Jupyter CUDA image with 02_model_training_basics.ipynb.
This test does not need to be exclusive for this image/script. If anything is missing or there are new scripts or images useful, feel free.
Test Objectives:
Test Environments:
follow up, nice to have in https://github.com/nerc-project/operations/issues/504:
Procedure:
This quality test aims to confirm that the new NVIDIA A100 GPUs are working and can be used for upcoming classes and projects.