hpdempsey commented 1 month ago

This issue: Step 0 See follow ups in Steps 1 to 4

623 - #624 - #625 - #626

Dominika Oliver doliver@redhat.com and Rebecca Whitworth rsimmond@redhat.com request a test project on the MOC.

Dominika and Rebecca are investigating options for AI optimizations with the OpenShift AI software. They would like to do some testing with two GPUs allocated in parallel in an OpenShift AI. They want to enable one with MIG and one without and run some experiments to follow up on the basic MIG testing we did earlier. They would be using Kruize an experimenting with different configurations and scenarios to see how workloads share the GPU and determine optimal timeslicing configurations.

This project needs to run in a test cluster because Kruise is not recommended for production an d because they require more recent versions of software than what we have in production. OpenShift 4.13.29 and OpenShift AI 2.8.0. They would like access to a cluster for 2 weeks for this experimentation. They would like to begin as soon as possible. Assigning Thorsten and Dylan to investigate whether we can use the OpenShift AI "beta" cluster for this experiment and how quickly we could get resources allocated and the cluster built, assuming that we now have some proof of usability from what Dylan build for OPE. Please investigate and determine when the cluster could be available. GPU costs would come from Red Hat's current funding allocation, because this is the type of work we were expecting to do as part of the "beta" OpenShift AI cluster work.

Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com

[ ] Users created in NERC/ColdFront/VPN
- [ ] @ddoliver (optional)
- [x] @rebeccaSimmonds19 (PI)
- [ ] @dinogun (optional)
- [x] @shekhar316
- [ ] @bharathappali (optional)
- [x] @bhanvimenghani
- [ ] @kusumachalasani (optional)
[x] Project created in ColdFront
- [x] KruizeOptimization
- [x] resources allocated
- [x] call with Milson planned
- [ ] assigned
- [x] @rebeccaSimmonds19 (PI)
- [x] @shekhar316 (Manager)
- [x] @bharathappali (Manager)
[ ] Test Cluster ==> on prod cluster
- [x] 4 GPU nodes allocated in prod
- [/] ~4 GPU nodes allocated in test~
- [/] ~https://github.com/nerc-project/operations/issues/613~
[ ] Own Cluster
- [ ] nese storage
- [ ] integrated into MOC network
- [ ] cable
- [ ] configured
- [ ] AWS & certificates
- [ ] overlay template created
- [ ] template deployed on Taj test cluster
- [ ] GPUs removed from prod cluster
- [ ] GPUs allocated new cluster Taj
- [ ] running on Taj test cluster
- [ ] eventually created own cluster
- [ ] template deployed on own cluster
- [ ] GPUs removed from cluster Taj
- [ ] GPUs allocated new own cluster

hpdempsey commented 1 month ago

I told Dominika and Rebecca to request accounts and to have one of them act as PI on this project to get started.

hpdempsey commented 1 month ago

Added Milson so he is aware of the request for GPUs that will need to be attached to the test cluster for 2 weeks. Project owners are happy to work with any research or eng participants who are also interested in this topic.

hpdempsey commented 1 month ago

Please expedite the evaluation for this project request, as there is some urgency to the work from the OpenShift AI engineers.

dystewart commented 1 month ago

I think we have 2 things to consider here:

For the OpenShift RHOAI beta Cluster

[ ] Acquire NESE storage for this cluster, based on the NERC meeting it sounds like we're starting with a 10Tb request NESE Request
[ ] Create a scaffolding repo (likely in nerc-ocp-config
[ ] Continue gpu testing, including MIG/accelerator profiles with rhoai which we couldn't test until upgrading OCP and rhoai
[ ] Aquire gpus and add them to the cluster

@hpdempsey In order to get Dominika & Rebecca access as soon as they have accounts we can onboard them to the nerc-ocp-test cluster while we build the rhoai beta cluster (where they would land after that rollout is complete). All we need is to grab a couple of gpus from somewhere, not sure if we'd just want to borrow them from prod for now, @Milstein if you have thoughts on this.nerc-ocp-test is running OCP 4.15 with rhoai v2.8.2 so it sounds like that is sufficient for their use case.

schwesig commented 1 month ago

That sounds like a plan 👍

Milstein commented 1 month ago

sounds good. We can on-board them to ocp test cluster to test out.

hpdempsey commented 1 month ago

This all sounds good, and Thorsten is coordinating with Dominika and Rebecca. The GPUs should come from prod OpenShift when you are ready.

schwesig commented 1 month ago

Call with Dominika, Rebecca, Dinakar today:

Timeframe:
- Project: 1 quarter, asap
- Cluster: 1 quarter (or a bitmore)
- Low End chips for the full cluster run time
- High End chips for multiple 1 to 3 weeks in between for special/new feature testing
- Scope:
- 3 questions
  - Ray cluster is idle
  - Ray cluster has requested more resources than it is using
  - Namespace quota is higher than aggregate of all ray clusters running in the namespace
- else
  - test and compare multiple chips and generations
  - available and usable features
  - splitting, sharing, moving workloads
  - performance under different workloads and sharing settings
  - how to split workloads?
  - time
  - task
  - pods
  - optimizing (with autotune/kruize local)

Requirements:

decent new version openshift
admin rights to the cluster
nvidia operator will be installed and
see also here https://docs.google.com/document/d/1p1mt4tV7i5DF39h6OCo6LwUjMLdFbM6KC-x9cUUYzMs/edit

Kruize keeps track of GPU usage for each container. The GPU usage statistics are provided by OpenShift Observability (Prometheus). We then use the cost and performance models to provide recommendations in terms of GPU limits. The GPU limits can then be used by GPU time slice schedulers such as Run:ai to enforce the limits and provide better utilization of GPUs.

Install Kruize alongside OpenShift AI. Observe resource usage across all of the OpenShift AI related namespaces over a period of time. (Current observation terms sizes = daily, weekly, fortnightly). Develop usage models based on observed data. Use these models to provide better resource usage defaults to both the OpenShift AI platform itself as well as the AI workloads.

schwesig commented 1 month ago

2x V100 (wrk-88/89) 11x A100 (wrk-90 to 99 and 101)

schwesig commented 1 month ago

Project Overview

This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.

Key Components:

Observability:

Evaluate GPU metrics provided by Nvidia dcgm-exporter and compare with OpenShift Observability. Identify gaps and consider other vendor GPUs (e.g., AMD, Intel).

Recommendation:

Based on GPU usage statistics over different terms (daily, weekly, fortnightly), recommendations are generated. Recommendations include specific container configurations and overall namespace suggestions.

Enforcement:

Use scheduling software like Run:ai for better time slicing of GPUs. The format of the recommendation JSON depends on the chosen enforcement software.

Possible Test Scenario

In an OpenShift AI environment, data scientists request namespace quotas for experiments.
Kruize monitors usage and provides recommendations for containers and namespaces.
An Ops module (Kruize Ops) can apply these recommendations to optimize resource usage.

Goals:

Install Kruize with OpenShift AI to observe and model resource usage.
Provide better resource usage defaults and configuration tuning for improved performance and cost efficiency.

Impact:

More efficient GPU infrastructure usage leading to lower costs.
Right-sizing resources and tuning configurations for better performance at reduced costs.

schwesig commented 1 month ago

Signing up for NERC: this week we will not be doing logistics as MGHPCC power shutdown going to stop all flow including NERC and FAS RC's

schwesig commented 1 month ago

@dystewart what do you need to start, besides the maintenance time to be over? what can I or we do even during the shutdown? Configuration?

schwesig commented 1 month ago

Access to NERC/ColdFront
- Rebecca (as PI)
- Dinakar
- Dominika (optional)
[ ] Acquire NESE storage for this cluster, based on the NERC meeting it sounds like we're starting with a 10Tb request https://github.com/nerc-project/operations/issues/579
Create a scaffolding repo (likely in nerc-ocp-config
[x] Assign GPUs to test (from prod?!) https://github.com/nerc-project/operations/issues/589
Build the RHOAI beta cluster
Assign GPUs to beta cluster
Move project to the beta cluster
Remove GPUs from the test cluster

schwesig commented 1 month ago

Wishlist from the Research Team

Desired GPUs:

V100
A100
H100

Definition of GPU Performance:

Low-Performance GPUs: Used throughout the quarter
High-Performance GPUs: Used intensively for a few weeks during the quarter
- A100 is the staple (and minimum) GPU needed for most testing
- H100 will be used sparingly whenever available

GPU/Node Requirements:

A single cluster with a minimum of 3 GPU nodes (minimum configuration)
Ideally, two such clusters
One desired configuration: a single node having 4 GPU cards connected using NVLink

schwesig commented 1 month ago

@dystewart can you check the latest entry (wishlist above), and what we are able to manage out of the box, where/how fast?

schwesig commented 1 month ago

side note, not (yet?) relevant

https://github.com/pytorch/pytorch/pull/106200#issuecomment-1668627049
GraceHopper (GH200 ARM) machine
see email from Heidi (2024-05-29)
- I guess my conclusion is that if we're benchmarking for inference/training throughput, we should only do that on systems with enough GPU memory (potentially tuning things to fit on lower memory systems) rather than potentially create obscure performance issues by using unified memory.

hpdempsey commented 1 month ago

Rebecca submitted her request to MOC for this project: https://mghpcc.supportsystem.com/scp/tickets.php?id=9520 Milson is aware that her request is the same as this request for the test cluster. @schwesig, it sounds like there will be different GPUs needed to be attached to the cluster at different times, so you will have to request the lower-end ones at the right times. The A100 should be available soon. No due date for H100s yet, so that part of the experiment may need to wait.

dystewart commented 1 month ago

I'm working with Justin to get the A100 gpu we previously had in the test cluster back online (it wasn't properly added back after being borrowed for ESI). Hopefully that gets done today and if not then tomorrow

@schwesig does the team have any software requirements aside from RHOAI that we need to get in place? Should we add those folks to this issue so they can get updates and add info here directly?

It wouldn't hurt to start building their project out in nerc-ocp-config/nerc-ocp-apps (if thats what they want) while we wait for the gpus and RHOAI cluster storage requests to be granted.

schwesig commented 1 month ago

@dystewart thanks They are following this issue and aware of it. But we can also assign them too.

No other requirements than mentioned above, they will add the rest themselves.

I will talk to them tomorrow again about getting more involved in this issue and getting the config started.

schwesig commented 1 month ago

General information:

V100 easy and up for long term usage
A100 reduce to extensive testing times and get involved when needed
H100 not yet available and no stable plan when

Different times during the testing need different GPU nodes to be pulled into the cluster.

schwesig commented 1 month ago

cc @dinogun @ddoliver @rebeccaSimmonds19

hpdempsey commented 1 month ago

H100s will probably be available in Sept. (BU is purchasing them.)

jtriley commented 1 month ago

@dystewart the wrk-10 host (with 4xA100 GPU cards) has been added back to the nerc-ocp-test cluster. See: https://github.com/nerc-project/operations/issues/589

Milstein commented 4 weeks ago

@hpdempsey: Is this request to access to the NERC OCP test cluster? If Yes! then we can't connect our current set up based on Registered users/PIs from RegApp and ColdFront to managed the resource allocation on the cluster - these web services are only connected to the prod cluster setup.

Milstein commented 4 weeks ago

@hpdempsey @schwesig : We have approved the PI request for Rebecca. She will follow the normal process as we outlined in our on-boarding process: https://nerc-project.github.io/nerc-docs/get-started/user-onboarding-on-NERC/ So future ColdFront allocation request will be assigned to Heidi to review and approve.

schwesig commented 3 weeks ago

[x] Call with Rebecca and Shekhar Saxena shesaxen@redhat.com
[x] check PI approval (all working)
[x] introduction to our github repositories, and how to gitop with overlays
[x] they will send us the github usernames for auth
[x] next steps: Thor getting together with Dylan

schwesig commented 3 weeks ago

having a status call with the project team getting all into ColdFront assigning members to the project (PI Rebecca) scaffolding basics (folder etc) via github in NERC repo adding them to Data Science Projects timeline and plans of workload

schwesig commented 3 weeks ago

create an account on NERC/ColdFront https://nerc-project.github.io/nerc-docs/get-started/user-onboarding-on-NERC/

Shekhar Saxena @shekhar316 shesaxen@redhat.com
Bharath Appali @bharathappali abharath@redhat.com
Bhanvi Menghani @bhanvimenghani bmenghan@redhat.com
Kusuma Chalasani @kusumachalasani kchalasa@redhat.com

Please create your project member (user) accounts to get access to VPN, (test cluster) and prod cluster

@Milstein as announced in chat

schwesig commented 3 weeks ago

@shekhar316 here the link from the call https://github.com/OCP-on-NERC/docs/tree/main/architecture/observability/access-control-to-metrics

schwesig commented 3 weeks ago

@rebeccaSimmonds19 adding them to the project https://youtu.be/d8_49bg27is?si=zgcVHr7iMZI2f9A0&t=139

schwesig commented 3 weeks ago

feedback on resources per timeline

we plan to provide recommendations for short, medium and long term usage (which currently translate to 1 day, 7 days and 15 days of usage, this can however change to custom definitions, such as long term = 90 days)
This means that we will have to run loads for at least 15 days continually, so that we can gather metrics for these periods and provide corresponding recommendations
So we have a test bench that can run continually for 15 days and generate various benchmark loads conditions and then gather metrics during the entire period.
We then use those metrics to analyze and provide recommendations

schwesig commented 2 weeks ago

these GPUs are NOT for this project available, we need to allocate/ move other ones from PROD

[x] https://github.com/nerc-project/operations/issues/613

joachimweyl commented 2 weeks ago

@hpdempsey do we have known funding for this?

schwesig commented 2 weeks ago

@hpdempsey do we have known funding for this?

Yes, see our meeting with Michael on Friday

schwesig commented 1 week ago

The team has access to the prod cluster since last week. Working on some access rights to manage MIG configs. Discussion is in Slack channel forum-kruize

https://redhat-internal.slack.com/archives/C04J5TX0UHZ/p1718703157963889?thread_ts=1718696079.255919&cid=C04J5TX0UHZ

schwesig commented 6 days ago

This issue: Step 0 - Initialize done See follow-ups in Steps 1 to 4

nerc-project / operations

Request from Red Hat for further GPU optimization (kruize project 0) #580

623 - #624 - #625 - #626

Project Overview

Key Components:

Observability:

Recommendation:

Enforcement:

Possible Test Scenario

Goals:

Impact:

Wishlist from the Research Team

Desired GPUs:

Definition of GPU Performance:

GPU/Node Requirements:

623 - #624 - #625 - #626