Open schwesig opened 2 months ago
It would be great to track metrics like:
On Tue, Sep 3, 2024 at 1:47 PM Hema Veeradhi @.***> wrote:
It would be great to track metrics like:
- Total number of GPUs on cluster
- Number of available GPUs
- Memory utilization for each GPU - how much memory is being used in each GPU
Thorsten is still on vacation, but I know he wass working on this. I don't know if we will have something in time for your 9/11 request, but that is an internal demo only, so I think we have a bit more flexibility. I will talk to Chris, who is also working on this, and we can get back to you on the dates later in the week.
— Reply to this email directly, view it on GitHub https://github.com/nerc-project/operations/issues/705#issuecomment-2327095952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMGMIEMP5SQIEITRMSZ6UBLZUXY33AVCNFSM6AAAAABNSRF4BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGA4TKOJVGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Discussion with @hpdempsey and @computate every cluster that is not connected to the INFRA cluster ACM, can not easily be tracked via OBS cluster. Therefore defined as blocker as of now.
But, also had a call with Tom C. about it, and the NVIDIA Operator gives dashboards that should be sufficient vie the OpenShift Web Console:
e.g.
sum by (exported_pod, exported_namespace) (DCGM_FI_DEV_GPU_UTIL{instance=~".+", gpu=~".+", exported_pod=~".+"})
currently defined as blocked, see comment https://github.com/nerc-project/operations/issues/705#issuecomment-2343538161
Observability Dashboard for MOC InstructLab (ocp-beta-test) Cluster
Motivation
ET needs to monitor GPU usage on the MOC InstructLab cluster.
Trigger
https://massopencloud.slack.com/archives/C027TDE52TZ/p1725376578285379
Completion Criteria
Description
Completion dates
Desired - 2024-09-11 Required - ?
Estimate
Involved
Must
@schwesig
Optionally
@DanNiESh @computate