Closed schwesig closed 3 months ago
waiting for? NESE? Template? ?
@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network https://github.com/CCI-MOC/ops-issues/issues/1329 though I do not think that will take long. CC @hpdempsey
@schwesig For the cluster I am bring up I will be adding the overlay to the config repo and applying it to the cluster today. The NESE storage request has been completed however Hakan is currently working on accessing NESE from MOC network CCI-MOC/ops-issues#1329 though I do not think that will take long. CC @hpdempsey
@tssala23 thanks for the update. Just wanted to know where we are at.
V100 : easily available and up for long-term usage A100 : available, but allocation must be reduced to extensive testing times and get only involved when needed H100 : not yet available, and the timeline is not stable enough to plan for the tests yet
To Do
2024-07-08 check PROD, nvidia-smi --> NVIDIA A100-SXM4-40GB
@schwesig does this require 4 A100 nodes only or does it also require nodes to run a cluster that the A100 nodes are workers in?
@schwesig does this require 4 A100 nodes only or does it also require nodes to run a cluster that the A100 nodes are workers in?
@joachimweyl It also requires additional nodes. At least that is the plan so far. Maybe the current or next testing can show something different. But as of now: nodes to run AND a100 nodes
So 3FC430s for the controllers and the 4 A100s will be the workers, do we need any other workers?
@joachimweyl Taj is setting up a cluster overlay template
That is the reason why they will be on their own dedicated cluster to
- Kruize is not recommended for production
- and because they require more recent versions of software than we have in production. OpenShift 4.13.29 and OpenShift AI 2.8.0
- the heavy workload testing can interfere with other projects
- the workload of other projects can interfere with the test results
- the team needs more global rights to slice the GPU nodes, which can interfere with other projects and configs
cluster is set up project is using it since yesterday in review: we are still waiting for 2 more users to be added and leaving this open, until the GOU node from PROD is removed.
dedicated cluster is running interims project on PROD was deleted and cleaned up this can be closed follow up support in issue 3 of 4. https://github.com/nerc-project/operations/issues/625
follow up from
Details for this issue
just related, not a must do for this issue:
This project needs to run in a dedicated test cluster because
Project Overview
This research project focuses on optimizing GPU infrastructure usage through Kruize, a platform that tracks GPU usage for each container. By integrating with OpenShift Observability (Prometheus) and using cost and performance models, Kruize provides recommendations for GPU limits. These recommendations can be enforced by GPU time slice schedulers like Run:ai to enhance GPU utilization, aiming to lower costs and improve performance.
Goals:
Install Kruize with OpenShift AI to observe and model resource usage. Provide better resource usage defaults and configuration tuning for improved performance and cost efficiency.
Steps:
CC
Dominika Oliver - doliver@redhat.com Rebecca Whitworth - rsimmond@redhat.com Dinakar Guniguntala - dgunigun@redhat.com
@ddoliver @rebeccaSimmonds19 @dinogun @shekhar316 @bharathappali @bhanvimenghani @kusumachalasani
@dystewart @schwesig @Milstein @tssala23
128.31.20.90 api.nerc-ocp-test-2.nerc.mghpcc.org 128.31.20.112 console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org 128.31.20.112 oauth-openshift.apps.nerc-ocp-test-2.nerc.mghpcc.org