nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

kruize project [2/4] - Dedicated project cluster (followup #580) #624

Closed schwesig closed 3 months ago

schwesig commented 4 months ago

follow up from

Details for this issue

just related, not a must do for this issue:

This project needs to run in a dedicated test cluster because


Project Overview

This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.

Goals:

Install Kruize with OpenShift AI to observe and model resource usage. Provide better resource usage defaults and configuration tuning for improved performance and cost efficiency.

Steps:

  1. Create users, project, and resources on NERC (done in https://github.com/nerc-project/operations/issues/580)
  2. Create an interim solution until a dedicated project cluster is available (https://github.com/nerc-project/operations/issues/623)
  3. Create a dedicated project cluster (THIS https://github.com/nerc-project/operations/issues/624)
  4. Support the tests: Running for 90+ days, switching GPU types, etc. (https://github.com/nerc-project/operations/issues/625)
  5. Closing the project: remove/archive cluster, GPU allocations, etc. (https://github.com/nerc-project/operations/issues/626)

CC

Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com

@ddoliver @rebeccaSimmonds19 @dinogun @shekhar316 @bharathappali @bhanvimenghani @kusumachalasani

@dystewart @schwesig @Milstein @tssala23


128.31.20.90 api.nerc-ocp-test-2.nerc.mghpcc.org 128.31.20.112 console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org 128.31.20.112 oauth-openshift.apps.nerc-ocp-test-2.nerc.mghpcc.org

schwesig commented 4 months ago

waiting for? NESE? Template? ?

tssala23 commented 4 months ago

@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network https://github.com/CCI-MOC/ops-issues/issues/1329 though I do not think that will take long. CC @hpdempsey

schwesig commented 4 months ago

@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network CCI-MOC/ops-issues#1329 though I do not think that will take long. CC @hpdempsey

@tssala23 thanks for the update. Just wanted to know where we are at.

schwesig commented 4 months ago

Timeline for the testing

Wishlist from the Research Team

Desired GPUs:

Definition of GPU Performance:

GPU/Node Requirements:

schwesig commented 4 months ago

General information/current status:

V100 : easily available and up for long-term usage A100 : available, but allocation must be reduced to extensive testing times and get only involved when needed H100 : not yet available, and the timeline is not stable enough to plan for the tests yet

schwesig commented 4 months ago

To Do

2024-07-08 check PROD, nvidia-smi --> NVIDIA A100-SXM4-40GB

joachimweyl commented 4 months ago

@schwesig does this require 4 A100 nodes only or does it also require nodes to run a cluster that the A100 nodes are workers in?

schwesig commented 4 months ago

@schwesig does this require 4 A100 nodes only or does it also require nodes to run a cluster that the A100 nodes are workers in?

@joachimweyl It also requires additional nodes. At least that is the plan so far. Maybe the current or next testing can show something different. But as of now: nodes to run AND a100 nodes

joachimweyl commented 4 months ago

So 3FC430s for the controllers and the 4 A100s will be the workers, do we need any other workers?

schwesig commented 4 months ago

@joachimweyl Taj is setting up a cluster overlay template

That is the reason why they will be on their own dedicated cluster to

- Kruize is not recommended for production
- and because they require more recent versions of software than we have in production. OpenShift 4.13.29 and OpenShift AI 2.8.0
- the heavy workload testing can interfere with other projects
- the workload of other projects can interfere with the test results
- the team needs more global rights to slice the GPU nodes, which can interfere with other projects and configs
schwesig commented 3 months ago

cluster is set up project is using it since yesterday in review: we are still waiting for 2 more users to be added and leaving this open, until the GOU node from PROD is removed.

schwesig commented 3 months ago
schwesig commented 3 months ago
schwesig commented 3 months ago

dedicated cluster is running interims project on PROD was deleted and cleaned up this can be closed follow up support in issue 3 of 4. https://github.com/nerc-project/operations/issues/625