Open schwesig opened 4 days ago
waiting for? NESE? Template? ?
@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network https://github.com/CCI-MOC/ops-issues/issues/1329 though I do not think that will take long. CC @hpdempsey
@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network CCI-MOC/ops-issues#1329 though I do not think that will take long. CC @hpdempsey
@tssala23 thanks for the update. Just wanted to know where we are at.
V100 : easily available and up for long-term usage A100 : available, but allocation must be reduced to extensive testing times and get only involved when needed H100 : not yet available, and the timeline is not stable enough to plan for the tests yet
follow up from
Details for this issue
This project needs to run in a dedicated test cluster because
Project Overview
This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.
Goals:
Install Kruize with OpenShift AI to observe and model resource usage. Provide better resource usage defaults and configuration tuning for improved performance and cost efficiency.
Steps:
CC
Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com
@ddoliver @rebeccaSimmonds19 @dinogun @shekhar316 @bharathappali @bhanvimenghani @kusumachalasani
@dystewart @schwesig @Milstein @tssala23