Spike: Maximize A100 GPU revenue on OpenStack and OpenShift

msdisme commented 2 months ago

This is an issue to gather details on maximizing A100 GPU revenue on both OpenStack and OpenShift to create an Epic.

The goal is to identify what tools we need to maximize GPU revenue and identify what tools we already have in place or in process.

I've been thinking about maximizing our A100 GPU usage on both OpenStack and Openshift, and I would like some feedback on my thoughts so that we may improve them.

I've written a first pass in the style of a (loose)epic and some user stories. To simplify the comment/review process here is a google doc.

Epic: As a team, we need to maximize our A100 GPU usage on both OpenStack and Openshift by implementing a dashboard that gives us real-time visibility into the GPU's availability, a system to notify us when nodes become available, and a way to monitor GPU and CPU usage for jobs.

User stories:

As a team member, I want a dashboard that displays GPUs' real-time availability so that we can efficiently allocate and utilize them.
As a team member, I want to receive notifications when nodes become available, so that we can quickly promote their availability and run jobs.
As a team member, I want to be able to split jobs across MIG virtualized GPUs and monitor their usage, so that we can optimize our resource allocation and reduce costs.
As a team member, I want to be able to track GPU and CPU usage for jobs and get insights on which jobs can be split across MIG virtualized GPUs, so that we can optimize performance and reduce costs.
As a team member, I want to be able to figure out pricing for MIG virtualized GPUs, so that we can make informed decisions on resource allocation and optimize costs.

hpdempsey commented 2 months ago

I provided some comments/questions on the Google doc.

schwesig commented 1 month ago

@schwesig added one task/idea today in the meeting: dashboard: kind of warning: "we have too much unused GPU, go and share the idea of using it"

msdisme commented 1 week ago

close as now tracked in https://github.com/CCI-MOC/ops-issues/issues/1328

nerc-project / operations

Spike: Maximize A100 GPU revenue on OpenStack and OpenShift #550