ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.2k stars 5.81k forks source link

[dashboard/core?] Disk space displayed in dashboard doesn't match size of disk. #48783

Open Joshuaalbert opened 1 week ago

Joshuaalbert commented 1 week ago

What happened + What you expected to happen

When setting temp-dir to a value on head it doesn't seem to be reflected in dashboard.

ray start --head --dashboard-host=0.0.0.0 --metrics-export-port=8090 --temp-dir=/path/bigdisk/temp

on worker it's started like this:

ray start --address="ray_head:${RAY_REDIS_PORT}"

The dashboard shows this

image

The first line is the head node, the second is a worker node. Both are run in container in working directories volumes mounted to 10TB size disks. Why is the head showing only 50GB. That's the size of / on the host, which it shouldn't have access to.

Versions / Dependencies

ray 2.37 also same on 2.39

Reproduction script

ray start --head --dashboard-host=0.0.0.0 --metrics-export-port=8090 --temp-dir=/path/bigdisk/temp

look at dashboard

Issue Severity

High: It blocks me from completing my task.

gitlijian commented 1 week ago

hi , @Joshuaalbert

  1. Execute the mount command on head node and worker node respectively to confirm if your mount path is correct.
  2. If the mounting path is correct, then there may be some bugs in the logic of ray
Joshuaalbert commented 1 week ago

Okay, I solved but not by changing the Ray side, but the docker side. Which makes me suspect that Ray has some strange potential unwanted behaviour with docker. I'll explain.

I noticed the node that showed 50B in the above screenshot was the same size as the disk that stores docker images. Now, this is really weird because nowhere am I mounting that partition as a volume in the container. So I tried moving the docker data dir to a different disk, and low-and-behold the storage shown in the dashboard changed to reflect that.

Joshuaalbert commented 1 week ago

Another useful info: the head and worker docker storage drivers are different between the two nodes. On the head node it is using overlay2 (which is kernel space), and on the worker fuse-overlay (which is user space). When I updated the storage driver on the head node to fuse-overlay it started showing the correct storage size.