schwesig commented 4 months ago

follow up from

https://github.com/nerc-project/operations/issues/580
PU costs would come from Red Hat's current funding allocation because this is the type of work we were expecting to do as part of the "beta" OpenShift AI cluster work.

Details for this issue

[x] https://github.com/CCI-MOC/ops-issues/issues/1329
[x] adding the overlay to the config repo https://github.com/OCP-on-NERC/nerc-ocp-config/pull/460
[x] add team members to github team for console access https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/ https://github.com/nerc-project/operations/issues/645
[x] check, if all needed rights/functions are available for the project
- [x] #650
[x] storage is connected and accessible (test-pv)
[x] create storage/set size
[x] https://github.com/nerc-project/operations/issues/622
[x] add GPU mode to (test-2 (remove from prod afterwards)) https://github.com/nerc-project/operations/issues/646
[x] #654
- [x] stay managed (be aware of parallel config areas)
- [x] do not use gitops (be aware of not able to re-produce, roll back, recreate, ...)
- [x] https://github.com/nerc-project/operations/issues/654
[x] https://github.com/nerc-project/operations/issues/663

just related, not a must do for this issue:

[x] https://github.com/nerc-project/operations/issues/665

This project needs to run in a dedicated test cluster because

Kruize is not recommended for production
and because they require more recent versions of software than we have in production. OpenShift 4.13.29 and OpenShift AI 2.8.0
the heavy workload testing can interfere with other projects
the workload of other projects can interfere with the test results
the team needs more global rights to slice the GPU nodes, which can interfere with other projects and configs

Project Overview

This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.

Goals:

Install Kruize with OpenShift AI to observe and model resource usage. Provide better resource usage defaults and configuration tuning for improved performance and cost efficiency.

Steps:

Create users, project, and resources on NERC (done in https://github.com/nerc-project/operations/issues/580)
Create an interim solution until a dedicated project cluster is available (https://github.com/nerc-project/operations/issues/623)
Create a dedicated project cluster (THIS https://github.com/nerc-project/operations/issues/624)
Support the tests: Running for 90+ days, switching GPU types, etc. (https://github.com/nerc-project/operations/issues/625)
Closing the project: remove/archive cluster, GPU allocations, etc. (https://github.com/nerc-project/operations/issues/626)

CC

Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com

@ddoliver @rebeccaSimmonds19 @dinogun @shekhar316 @bharathappali @bhanvimenghani @kusumachalasani

@dystewart @schwesig @Milstein @tssala23

128.31.20.90 api.nerc-ocp-test-2.nerc.mghpcc.org 128.31.20.112 console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org 128.31.20.112 oauth-openshift.apps.nerc-ocp-test-2.nerc.mghpcc.org

schwesig commented 4 months ago

waiting for? NESE? Template? ?

tssala23 commented 4 months ago

@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network https://github.com/CCI-MOC/ops-issues/issues/1329 though I do not think that will take long. CC @hpdempsey

schwesig commented 4 months ago

@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network CCI-MOC/ops-issues#1329 though I do not think that will take long. CC @hpdempsey

@tssala23 thanks for the update. Just wanted to know where we are at.

schwesig commented 4 months ago

Timeline for the testing

we plan to provide recommendations for short, medium and, long term usage, which currently translates to
- 1 day
- 7 days
- 15 days of usage,
- this can, however, change to custom definitions, such as long term = 90 days
This means that we will have to run loads for at least 15 days continually (better 90), so that we can gather metrics for these periods and provide corresponding recommendations So we have a test bench that can run continually for 15 days, generate various benchmark load conditions, and gather metrics during the entire period.
We then use those metrics to analyze and provide recommendations

Wishlist from the Research Team

Desired GPUs:

V100
A100
H100

Definition of GPU Performance:

Low-Performance GPUs: Used throughout the quarter
High-Performance GPUs: Used intensively for a few weeks during the quarter
- A100 is the staple (and minimum) GPU needed for most testing
- H100 will be used sparingly whenever available

GPU/Node Requirements:

A single cluster with a minimum of 3 GPU nodes (minimum configuration)
Ideally, two such clusters
One desired configuration: a single node having 4 GPU cards connected using NVLink

schwesig commented 4 months ago

General information/current status:

V100 : easily available and up for long-term usage A100 : available, but allocation must be reduced to extensive testing times and get only involved when needed H100 : not yet available, and the timeline is not stable enough to plan for the tests yet

schwesig commented 4 months ago

To Do

[x] check the RAM of the A100
[x] what versions of A100 do we have
- [x] NVIDIA A100-SXM4-40GB @dystewart @tssala23 do we have a documentation/log about this, or is it ad-hoc live report with nvidia sm command?

2024-07-08 check PROD, nvidia-smi --> NVIDIA A100-SXM4-40GB

joachimweyl commented 4 months ago

@schwesig does this require 4 A100 nodes only or does it also require nodes to run a cluster that the A100 nodes are workers in?

schwesig commented 4 months ago

@schwesig does this require 4 A100 nodes only or does it also require nodes to run a cluster that the A100 nodes are workers in?

@joachimweyl It also requires additional nodes. At least that is the plan so far. Maybe the current or next testing can show something different. But as of now: nodes to run AND a100 nodes

joachimweyl commented 4 months ago

So 3FC430s for the controllers and the 4 A100s will be the workers, do we need any other workers?

schwesig commented 4 months ago

@joachimweyl Taj is setting up a cluster overlay template

https://github.com/OCP-on-NERC/nerc-ocp-config/pull/460 And the project will have some control over how to set it up. Also the GPU nodes will be switched out to a different version, to create more divers GPU test results. For the beginning we will start in Taj's project space.

That is the reason why they will be on their own dedicated cluster to

- Kruize is not recommended for production
- and because they require more recent versions of software than we have in production. OpenShift 4.13.29 and OpenShift AI 2.8.0
- the heavy workload testing can interfere with other projects
- the workload of other projects can interfere with the test results
- the team needs more global rights to slice the GPU nodes, which can interfere with other projects and configs

schwesig commented 3 months ago

cluster is set up project is using it since yesterday in review: we are still waiting for 2 more users to be added and leaving this open, until the GOU node from PROD is removed.

schwesig commented 3 months ago

https://github.com/OCP-on-NERC/nerc-ocp-config/pull/482

schwesig commented 3 months ago

https://github.com/nerc-project/operations/issues/663

schwesig commented 3 months ago

dedicated cluster is running interims project on PROD was deleted and cleaned up this can be closed follow up support in issue 3 of 4. https://github.com/nerc-project/operations/issues/625

nerc-project / operations

kruize project [2/4] - Dedicated project cluster (followup #580) #624

Details for this issue

Project Overview

Goals:

Steps:

CC

Timeline for the testing

Wishlist from the Research Team

Desired GPUs:

Definition of GPU Performance:

GPU/Node Requirements:

General information/current status: