[Ray Clusters] Only the first few worker nodes sync files (file mount)

What happened + What you expected to happen

Hi, I rely on Ray Tune to tune hyperparameters on GCP. I am now on the latest version (2.37.0). After launching my training pipeline, only the first few worker nodes will sync the files I listed under the file_mounts key in the configuration file, the others will not, therefore the process cannot complete. As expected, all worker nodes are on the very same environment, which boggles me…

I also noticed on the logs, under the “[2/7] Processing file mounts” stage, that those first workers do indeed sync files, but for the next, nothing syncs and I get something like “No worker file mounts to sync”.

I tried to send my files via the worker_setup_commands, copying from GCP storage through the gsutil tool, but that didn’t work, it seems the commands under this key are being ignored. As I fact, I just noticed that the setup_commands are also being skipped, as the log says "No setup_commands to run".

What might be going wrong? How can I investigate further?

Versions / Dependencies

Ray Tune, Serve 2.37.0

Reproduction script

I guess trying to reproduce any of the tutorial would face the same issue. I don't think it is code related (from my part), otherwise no worker node would sync, but the first ones do.

Issue Severity

High: It blocks me from completing my task.

ray-project / ray