nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Request from Red Hat for further GPU optimization (kruize project 0) #580

Closed hpdempsey closed 6 days ago

hpdempsey commented 1 month ago

This issue: Step 0 See follow ups in Steps 1 to 4

623 - #624 - #625 - #626

Dominika Oliver doliver@redhat.com and Rebecca Whitworth rsimmond@redhat.com request a test project on the MOC.

Dominika and Rebecca are investigating options for AI optimizations with the OpenShift AI software. They would like to do some testing with two GPUs allocated in parallel in an OpenShift AI. They want to enable one with MIG and one without and run some experiments to follow up on the basic MIG testing we did earlier. They would be using Kruize an experimenting with different configurations and scenarios to see how workloads share the GPU and determine optimal timeslicing configurations.

This project needs to run in a test cluster because Kruise is not recommended for production an d because they require more recent versions of software than what we have in production. OpenShift 4.13.29 and OpenShift AI 2.8.0. They would like access to a cluster for 2 weeks for this experimentation. They would like to begin as soon as possible. Assigning Thorsten and Dylan to investigate whether we can use the OpenShift AI "beta" cluster for this experiment and how quickly we could get resources allocated and the cluster built, assuming that we now have some proof of usability from what Dylan build for OPE. Please investigate and determine when the cluster could be available. GPU costs would come from Red Hat's current funding allocation, because this is the type of work we were expecting to do as part of the "beta" OpenShift AI cluster work.

Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com

hpdempsey commented 1 month ago

I told Dominika and Rebecca to request accounts and to have one of them act as PI on this project to get started.

hpdempsey commented 1 month ago

Added Milson so he is aware of the request for GPUs that will need to be attached to the test cluster for 2 weeks. Project owners are happy to work with any research or eng participants who are also interested in this topic.

hpdempsey commented 1 month ago

Please expedite the evaluation for this project request, as there is some urgency to the work from the OpenShift AI engineers.

dystewart commented 1 month ago

I think we have 2 things to consider here:

For the OpenShift RHOAI beta Cluster

@hpdempsey In order to get Dominika & Rebecca access as soon as they have accounts we can onboard them to the nerc-ocp-test cluster while we build the rhoai beta cluster (where they would land after that rollout is complete). All we need is to grab a couple of gpus from somewhere, not sure if we'd just want to borrow them from prod for now, @Milstein if you have thoughts on this.nerc-ocp-test is running OCP 4.15 with rhoai v2.8.2 so it sounds like that is sufficient for their use case.

schwesig commented 1 month ago

That sounds like a plan 👍

Milstein commented 1 month ago

sounds good. We can on-board them to ocp test cluster to test out.

hpdempsey commented 1 month ago

This all sounds good, and Thorsten is coordinating with Dominika and Rebecca. The GPUs should come from prod OpenShift when you are ready.

schwesig commented 1 month ago

Call with Dominika, Rebecca, Dinakar today:

Requirements:

Kruize keeps track of GPU usage for each container. The GPU usage statistics are provided by OpenShift Observability (Prometheus). We then use the cost and performance models to provide recommendations in terms of GPU limits. The GPU limits can then be used by GPU time slice schedulers such as Run:ai to enforce the limits and provide better utilization of GPUs.

Install Kruize alongside OpenShift AI. Observe resource usage across all of the OpenShift AI related namespaces over a period of time. (Current observation terms sizes = daily, weekly, fortnightly). Develop usage models based on observed data. Use these models to provide better resource usage defaults to both the OpenShift AI platform itself as well as the AI workloads.

schwesig commented 1 month ago

2x V100 (wrk-88/89) 11x A100 (wrk-90 to 99 and 101)

schwesig commented 1 month ago

Project Overview

This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.

Key Components:

Observability:

Evaluate GPU metrics provided by Nvidia dcgm-exporter and compare with OpenShift Observability. Identify gaps and consider other vendor GPUs (e.g., AMD, Intel).

Recommendation:

Based on GPU usage statistics over different terms (daily, weekly, fortnightly), recommendations are generated. Recommendations include specific container configurations and overall namespace suggestions.

Enforcement:

Use scheduling software like Run:ai for better time slicing of GPUs. The format of the recommendation JSON depends on the chosen enforcement software.

Possible Test Scenario

Goals:

Impact:

schwesig commented 1 month ago

Signing up for NERC: this week we will not be doing logistics as MGHPCC power shutdown going to stop all flow including NERC and FAS RC's

schwesig commented 1 month ago

@dystewart what do you need to start, besides the maintenance time to be over? what can I or we do even during the shutdown? Configuration?

schwesig commented 1 month ago
schwesig commented 1 month ago

Wishlist from the Research Team

Desired GPUs:

Definition of GPU Performance:

GPU/Node Requirements:

schwesig commented 1 month ago

@dystewart can you check the latest entry (wishlist above), and what we are able to manage out of the box, where/how fast?

schwesig commented 1 month ago

side note, not (yet?) relevant

hpdempsey commented 1 month ago

Rebecca submitted her request to MOC for this project: https://mghpcc.supportsystem.com/scp/tickets.php?id=9520 Milson is aware that her request is the same as this request for the test cluster. @schwesig, it sounds like there will be different GPUs needed to be attached to the cluster at different times, so you will have to request the lower-end ones at the right times. The A100 should be available soon. No due date for H100s yet, so that part of the experiment may need to wait.

dystewart commented 1 month ago

I'm working with Justin to get the A100 gpu we previously had in the test cluster back online (it wasn't properly added back after being borrowed for ESI). Hopefully that gets done today and if not then tomorrow

@schwesig does the team have any software requirements aside from RHOAI that we need to get in place? Should we add those folks to this issue so they can get updates and add info here directly?

It wouldn't hurt to start building their project out in nerc-ocp-config/nerc-ocp-apps (if thats what they want) while we wait for the gpus and RHOAI cluster storage requests to be granted.

schwesig commented 1 month ago

@dystewart thanks They are following this issue and aware of it. But we can also assign them too.

No other requirements than mentioned above, they will add the rest themselves.

I will talk to them tomorrow again about getting more involved in this issue and getting the config started.

schwesig commented 1 month ago

General information:

Different times during the testing need different GPU nodes to be pulled into the cluster.

schwesig commented 1 month ago

cc @dinogun @ddoliver @rebeccaSimmonds19

hpdempsey commented 1 month ago

H100s will probably be available in Sept. (BU is purchasing them.)

jtriley commented 1 month ago

@dystewart the wrk-10 host (with 4xA100 GPU cards) has been added back to the nerc-ocp-test cluster. See: https://github.com/nerc-project/operations/issues/589

Milstein commented 4 weeks ago

@hpdempsey: Is this request to access to the NERC OCP test cluster? If Yes! then we can't connect our current set up based on Registered users/PIs from RegApp and ColdFront to managed the resource allocation on the cluster - these web services are only connected to the prod cluster setup.

Milstein commented 4 weeks ago

@hpdempsey @schwesig : We have approved the PI request for Rebecca. She will follow the normal process as we outlined in our on-boarding process: https://nerc-project.github.io/nerc-docs/get-started/user-onboarding-on-NERC/ So future ColdFront allocation request will be assigned to Heidi to review and approve.

schwesig commented 3 weeks ago
schwesig commented 3 weeks ago

having a status call with the project team getting all into ColdFront assigning members to the project (PI Rebecca) scaffolding basics (folder etc) via github in NERC repo adding them to Data Science Projects timeline and plans of workload

schwesig commented 3 weeks ago

create an account on NERC/ColdFront https://nerc-project.github.io/nerc-docs/get-started/user-onboarding-on-NERC/

Please create your project member (user) accounts to get access to VPN, (test cluster) and prod cluster

@Milstein as announced in chat

schwesig commented 3 weeks ago

@shekhar316 here the link from the call https://github.com/OCP-on-NERC/docs/tree/main/architecture/observability/access-control-to-metrics

schwesig commented 3 weeks ago

@rebeccaSimmonds19 adding them to the project https://youtu.be/d8_49bg27is?si=zgcVHr7iMZI2f9A0&t=139

schwesig commented 3 weeks ago

feedback on resources per timeline

schwesig commented 2 weeks ago

these GPUs are NOT for this project available, we need to allocate/ move other ones from PROD

joachimweyl commented 2 weeks ago

@hpdempsey do we have known funding for this?

schwesig commented 2 weeks ago

@hpdempsey do we have known funding for this?

Yes, see our meeting with Michael on Friday

schwesig commented 1 week ago

The team has access to the prod cluster since last week. Working on some access rights to manage MIG configs. Discussion is in Slack channel forum-kruize

https://redhat-internal.slack.com/archives/C04J5TX0UHZ/p1718703157963889?thread_ts=1718696079.255919&cid=C04J5TX0UHZ

schwesig commented 6 days ago

This issue: Step 0 - Initialize done See follow-ups in Steps 1 to 4

623 - #624 - #625 - #626