kruize project 1 - Interim solution on prod cluster (followup #580)

schwesig commented 4 days ago

follow up from

https://github.com/nerc-project/operations/issues/580

Details for this issue

[x] coldfront project activated, resources and users assigned
[ ] pre-test and pre-setup (template for dedicated cluster) can be done

Project Overview

This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.

Goals:

Install Kruize with OpenShift AI to observe and model resource usage. Provide better resource usage defaults and configuration tuning for improved performance and cost efficiency.

Steps:

Create users, project, and resources on NERC (done in https://github.com/nerc-project/operations/issues/580)
Create an interim solution until a dedicated project cluster is available (THIS https://github.com/nerc-project/operations/issues/623)
Create a dedicated project cluster (https://github.com/nerc-project/operations/issues/624)
Support the tests: Running for 90+ days, switching GPU types, etc. (https://github.com/nerc-project/operations/issues/625)
Closing the project: remove/archive cluster, GPU allocations, etc. (https://github.com/nerc-project/operations/issues/626)

CC

Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com

@ddoliver @rebeccaSimmonds19 @dinogun @shekhar316 @bharathappali @bhanvimenghani @kusumachalasani

@dystewart @schwesig @Milstein @tssala23

schwesig commented 4 days ago

https://redhat-internal.slack.com/archives/C04J5TX0UHZ/p1718696079255919

Actually we can't perform the operations of querying hardware details oc get nodes, apply MIG configurations for hardware/GPU partitioning, and observe data flow across all namespaces. We can access the resources in the current namespace, but to apply MIG configs, we need access to other namespaces also.

schwesig commented 4 days ago

https://redhat-internal.slack.com/archives/C04J5TX0UHZ/p1719402125519979?thread_ts=1718696079.255919&cid=C04J5TX0UHZ @dystewart

We need access to nvidia-gpu-operator namespace + access to at least 2 additional namespaces (1 we already have kruizeoptimization-143c8e, 1 more is required) Can't we create one more namespace within our project, if that is possible?

nerc-project / operations