ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.69k stars 5.72k forks source link

[Ray Dashboard] Observe GRAM Process Consumption #46213

Open gilljon opened 4 months ago

gilljon commented 4 months ago

Description

It would be very useful to have some insight from the Ray Dashboard regarding the actual GRAM a given Ray Actor/Task is consuming. Screenshot 2024-06-11 at 1 28 28 PM (1)

Use case

When allocating GPU resources (since resources are fractional), it would be beneficial to see how much GPU memory a given Ray actor/task actual consumes. Then, based off this, you can make a better informed allocation decision.

liuxsh9 commented 4 months ago

We also have similar ideas. I'd like to know, in addition to the actual GRAM usage at the actor/task level, do you think adding a column to describe the logical resource usage of the GPU or XPU would be helpful? 20240627-144155(WeLinkPC) Something like that could help better understand why the GRAM may have a certain amount of idle capacity, or assist in identifying cases where there is a significant deviation between the logical resource allocation and the actual usage. Looking forward to hearing your thoughts. @gilljon

liuxsh9 commented 4 months ago

Additionally, the GRAM info at actor/task level is actually available, but there's a small bug that's preventing it from being displayed. We will help fix it.

gilljon commented 4 months ago

We also have similar ideas. I'd like to know, in addition to the actual GRAM usage at the actor/task level, do you think adding a column to describe the logical resource usage of the GPU or XPU would be helpful?

I do think describing the logical resource is useful. We saturate our GPUs, so I always have about 90% utilization but am lost what is actually consuming the resources.