ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.25k stars 5.63k forks source link

[dashboard] Wonky GPU display #14664

Open DmitriGekhtman opened 3 years ago

DmitriGekhtman commented 3 years ago

What is the problem?

Two bugs -

(1) The GPU field for a n-gpu node looks like this -- '[0]: N/A [1]: N/A [2]: N/A ... [n-1]: N/A' which isn't too informative. Hovering mouse over each index shows a tooltip with the type of the GPU.

(2) If you launch a multi-GPU head (e.g. g4dn.12xlarge) and a single-GPU worker (e.g. p2.xlarge), the info rows for the head and worker may swap with each other every few seconds, which makes it hard to read the dashboard.
I saw this when launching on AWS and K8s a few hours ago. The very last time I tried this a few minutes ago, this bug didn't appear.

Screen Shot 2021-03-13 at 5 17 46 AM

Ray version and other system information (Python version, TensorFlow version, OS): cluster launcher 2.0.0dev, rayproject/ray:nightly-gpu docker image

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

scottsun94 commented 1 year ago

(1) Not sure if this still happens in the new dashboard (2) This will be fixed by better default sorting which prevents nodes from shifting a lot in the node table. cc: @alanwguo

DmitriGekhtman commented 1 year ago

I'm guessing that this might have been resolved, @alanwguo can confirm.

scottsun94 commented 1 year ago
Screen Shot 2022-10-24 at 5 23 22 PM Screen Shot 2022-10-24 at 5 24 10 PM

Took a quick look and it seems that:

  1. The css style of the gpu stats is off, still using the one from the old dashboard.
  2. The tooltip could be improved to show more info. Screen Shot 2022-10-24 at 5 34 36 PM
rkooo567 commented 1 year ago

cc @alanwguo @scottsun94 are we going to fix this as a part of frontend revamp?

scottsun94 commented 1 year ago

We could keep it as a p1 or p2 and fix it when we polish each page after Chao is on board.

scottsun94 commented 1 year ago

If we could show per-process gpu usage and gram usage, that will be great!

Screen Shot 2023-01-27 at 3 06 59 PM
scottsun94 commented 1 year ago

A potential use case for having this: https://github.com/ray-project/ray/issues/31998

rkooo567 commented 1 year ago

Let's bump up the priority. Flexible GPU usage is the main use case of Ray, so we should have as great observability as possible

sip-aravind-g commented 8 months ago

still this issue persisted ? I'm guessing this issue perhaps resolved ? can you confirm me once ?