Closed msdisme closed 1 week ago
I provided some comments/questions on the Google doc.
@schwesig added one task/idea today in the meeting: dashboard: kind of warning: "we have too much unused GPU, go and share the idea of using it"
close as now tracked in https://github.com/CCI-MOC/ops-issues/issues/1328
This is an issue to gather details on maximizing A100 GPU revenue on both OpenStack and OpenShift to create an Epic.
The goal is to identify what tools we need to maximize GPU revenue and identify what tools we already have in place or in process.
I've been thinking about maximizing our A100 GPU usage on both OpenStack and Openshift, and I would like some feedback on my thoughts so that we may improve them.
I've written a first pass in the style of a (loose)epic and some user stories. To simplify the comment/review process here is a google doc.
Epic: As a team, we need to maximize our A100 GPU usage on both OpenStack and Openshift by implementing a dashboard that gives us real-time visibility into the GPU's availability, a system to notify us when nodes become available, and a way to monitor GPU and CPU usage for jobs.
User stories: