nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Observability Dashboard for MOC InstructLab (ocp-beta-test) Cluster #705

Open schwesig opened 2 weeks ago

schwesig commented 2 weeks ago

currently defined as blocked, see comment https://github.com/nerc-project/operations/issues/705#issuecomment-2343538161

Observability Dashboard for MOC InstructLab (ocp-beta-test) Cluster

Motivation

ET needs to monitor GPU usage on the MOC InstructLab cluster.

Trigger

https://massopencloud.slack.com/archives/C027TDE52TZ/p1725376578285379

Completion Criteria

Description

Completion dates

Desired - 2024-09-11 Required - ?

Estimate

Involved

Must

@schwesig

Optionally

@DanNiESh @computate

hemajv commented 2 weeks ago

It would be great to track metrics like:

  1. Total number of GPUs on cluster
  2. Number of available GPUs
  3. Memory utilization for each GPU - how much memory is being used in each GPU
  4. Processes/applications running on each GPU - what process/application is being run on each GPU
hpdempsey commented 2 weeks ago

On Tue, Sep 3, 2024 at 1:47 PM Hema Veeradhi @.***> wrote:

It would be great to track metrics like:

  1. Total number of GPUs on cluster
  2. Number of available GPUs
  3. Memory utilization for each GPU - how much memory is being used in each GPU

Thorsten is still on vacation, but I know he wass working on this. I don't know if we will have something in time for your 9/11 request, but that is an internal demo only, so I think we have a bit more flexibility. I will talk to Chris, who is also working on this, and we can get back to you on the dates later in the week.

— Reply to this email directly, view it on GitHub https://github.com/nerc-project/operations/issues/705#issuecomment-2327095952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMGMIEMP5SQIEITRMSZ6UBLZUXY33AVCNFSM6AAAAABNSRF4BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGA4TKOJVGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

schwesig commented 1 week ago

Discussion with @hpdempsey and @computate every cluster that is not connected to the INFRA cluster ACM, can not easily be tracked via OBS cluster. Therefore defined as blocker as of now.

But, also had a call with Tom C. about it, and the NVIDIA Operator gives dashboards that should be sufficient vie the OpenShift Web Console: e.g. sum by (exported_pod, exported_namespace) (DCGM_FI_DEV_GPU_UTIL{instance=~".+", gpu=~".+", exported_pod=~".+"})