Open jav-ed opened 11 months ago
@architkulkarni can you review and triage?
@anyscalesam Hello Folks, We're facing the same issue. Any updates or suggestions to working around this ? Thanks
Hello folks, I have found that if we use ray force --stop
on both head and worker start ray commands, it seems to work.
Also, had to follow https://github.com/ray-project/ray/issues/39565#issuecomment-1846595876 for the worker to start next time, if I had shutdown the cluster(have to manually down the worker for it stop) previously.
https://github.com/ray-project/ray/issues/46204 https://github.com/ray-project/ray/issues/45571 seems related.
@millefalcon I tried using ray stop --force
before the head and worker start commands as you suggested, but haven't been able to setup the worker node. I've a similar yaml
file as presented in the issue. Can you share your yaml file and probably steps that you performed?
@pratos I don't have the exact yaml at the moment, but it is mostly similar to example-full.yaml(local). The main difference was I tried ray stop --force
instead of just ray stop
for both the head node and the workers.
I followed the exact steps as mentioned here https://github.com/ray-project/ray/issues/39565#issuecomment-1846595876.
Note: In hindsight, it only worked intermittently. I'd to write a wrapper script that ssh into the worker nodes and do the ray stop;..ray start ...
etc to make it work always.
So I guess, it didn't exactly fully fix my issue, sorry.
What happened + What you expected to happen
I have multiple pcs that are connected and can be accesses easily through ssh. Going manually inside a pc, that is the node, and defining it to be the head or the worker is working fine. The issue arises, when I try to do the very same thing using the config.yaml.
First, the manual procedure:
now ssh into all the other machines that shall be the workers and perform ray start --address=
Using ray status or viewing the dashboard, it can be observed that all the desired nodes are online.
Now this shall be replicated with a config.yaml. However, sometimes when I have luck it will find the workers and mostly it will not find the workers.
Versions / Dependencies
Reproduction script
Please see the description above, that is the config.yaml
Issue Severity
Medium: It is a significant difficulty but I can work around it.