ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.23k stars 5.81k forks source link

[Ray Clusters] Only the first few worker nodes sync files (file mount) #47916

Open dennymarcels opened 1 month ago

dennymarcels commented 1 month ago

What happened + What you expected to happen

Hi, I rely on Ray Tune to tune hyperparameters on GCP. I am now on the latest version (2.37.0). After launching my training pipeline, only the first few worker nodes will sync the files I listed under the file_mounts key in the configuration file, the others will not, therefore the process cannot complete. As expected, all worker nodes are on the very same environment, which boggles me…

I also noticed on the logs, under the “[2/7] Processing file mounts” stage, that those first workers do indeed sync files, but for the next, nothing syncs and I get something like “No worker file mounts to sync”.

I tried to send my files via the worker_setup_commands, copying from GCP storage through the gsutil tool, but that didn’t work, it seems the commands under this key are being ignored. As I fact, I just noticed that the setup_commands are also being skipped, as the log says "No setup_commands to run".

What might be going wrong? How can I investigate further?

Versions / Dependencies

Ray Tune, Serve 2.37.0

Reproduction script

I guess trying to reproduce any of the tutorial would face the same issue. I don't think it is code related (from my part), otherwise no worker node would sync, but the first ones do.

Issue Severity

High: It blocks me from completing my task.

dayshah commented 1 week ago

Hey @dennymarcels

  1. For your initial issue with file_mounts not syncing on all the machines, is it possible that file_mounts contents changes while you're cluster is acquiring more nodes? Setting file_mounts_sync_continuously to true in your config file might help. Let me know if that changes anything.

  2. For your alternative attempt through worker_setup_commands, can you share what you're setup_commands and worker_setup_commands lists are, something might be filtering them out before they actually go to the worker nodes.