nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

on hold- Allocate 4 GPU nodes for the test cluster NERC Project 408 KruizeOptimization #613

Closed schwesig closed 4 days ago

schwesig commented 2 weeks ago

Update:

not needed anymore. interim testing is running on prod cluster

on hold until further note

no to do yet

Please allocate 4 GPU nodes for the test cluster required for the NERC Project 408: KruizeOptimization. This allocation is essential for the ongoing GPU optimization tests. The project aims to:

Conduct AI optimizations using OpenShift AI software. Enable testing with both MIG-enabled and non-MIG GPUs. Experiment with different configurations to optimize GPU utilization and timeslicing.

This is an interim solution for the project on the test cluster, until their own (prod) cluster is ready to use.

Needed by: https://github.com/nerc-project/operations/issues/580

CC @larsks @dystewart @hpdempsey

joachimweyl commented 1 week ago

@schwesig is this still on hold?

dystewart commented 1 week ago

The team has confirmed they are able to access the gpus through RHOAI after upping their allocation in Coldfront

joachimweyl commented 5 days ago

@dystewart is this still on hold?

joachimweyl commented 5 days ago

and just to confirm this is 4 different nodes that were allocated for https://github.com/nerc-project/operations/issues/595 correct?

schwesig commented 4 days ago

and just to confirm this is 4 different nodes that were allocated for #595 correct?

yes, (unfortunately) these are for another project. I thought we could use them.

BUT: test cluster not needed anymore, we got the team set up on the prod cluster for first tests, until we got their own dedicated cluster

Closing this now.