Open dennymarcels opened 1 month ago
Hey @dennymarcels
For your initial issue with file_mounts not syncing on all the machines, is it possible that file_mounts contents changes while you're cluster is acquiring more nodes? Setting file_mounts_sync_continuously
to true in your config file might help. Let me know if that changes anything.
For your alternative attempt through worker_setup_commands, can you share what you're setup_commands and worker_setup_commands lists are, something might be filtering them out before they actually go to the worker nodes.
What happened + What you expected to happen
Hi, I rely on Ray Tune to tune hyperparameters on GCP. I am now on the latest version (2.37.0). After launching my training pipeline, only the first few worker nodes will sync the files I listed under the
file_mounts
key in the configuration file, the others will not, therefore the process cannot complete. As expected, all worker nodes are on the very same environment, which boggles me…I also noticed on the logs, under the “[2/7] Processing file mounts” stage, that those first workers do indeed sync files, but for the next, nothing syncs and I get something like “No worker file mounts to sync”.
I tried to send my files via the
worker_setup_commands
, copying from GCP storage through thegsutil
tool, but that didn’t work, it seems the commands under this key are being ignored. As I fact, I just noticed that thesetup_commands
are also being skipped, as the log says "No setup_commands to run".What might be going wrong? How can I investigate further?
Versions / Dependencies
Ray Tune, Serve 2.37.0
Reproduction script
I guess trying to reproduce any of the tutorial would face the same issue. I don't think it is code related (from my part), otherwise no worker node would sync, but the first ones do.
Issue Severity
High: It blocks me from completing my task.