nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

kruize project 2 - Dedicated project cluster (followup #580) #624

Open schwesig opened 4 days ago

schwesig commented 4 days ago

follow up from

Details for this issue

This project needs to run in a dedicated test cluster because


Project Overview

This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.

Goals:

Install Kruize with OpenShift AI to observe and model resource usage. Provide better resource usage defaults and configuration tuning for improved performance and cost efficiency.

Steps:

  1. Create users, project, and resources on NERC (done in https://github.com/nerc-project/operations/issues/580)
  2. Create an interim solution until a dedicated project cluster is available (https://github.com/nerc-project/operations/issues/623)
  3. Create a dedicated project cluster (THIS https://github.com/nerc-project/operations/issues/624)
  4. Support the tests: Running for 90+ days, switching GPU types, etc. (https://github.com/nerc-project/operations/issues/625)
  5. Closing the project: remove/archive cluster, GPU allocations, etc. (https://github.com/nerc-project/operations/issues/626)

CC

Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com

@ddoliver @rebeccaSimmonds19 @dinogun @shekhar316 @bharathappali @bhanvimenghani @kusumachalasani

@dystewart @schwesig @Milstein @tssala23

schwesig commented 4 days ago

waiting for? NESE? Template? ?

tssala23 commented 4 days ago

@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network https://github.com/CCI-MOC/ops-issues/issues/1329 though I do not think that will take long. CC @hpdempsey

schwesig commented 4 days ago

@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network CCI-MOC/ops-issues#1329 though I do not think that will take long. CC @hpdempsey

@tssala23 thanks for the update. Just wanted to know where we are at.

schwesig commented 4 days ago

Timeline for the testing

Wishlist from the Research Team

Desired GPUs:

Definition of GPU Performance:

GPU/Node Requirements:

schwesig commented 4 days ago

General information/current status:

V100 : easily available and up for long-term usage A100 : available, but allocation must be reduced to extensive testing times and get only involved when needed H100 : not yet available, and the timeline is not stable enough to plan for the tests yet