skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Core] Skip worker ray start for multinode #4390

Open Michaelvll opened 17 hours ago

Michaelvll commented 17 hours ago

If the ray cluster is healthy, this skips the worker node ray start command, i.e. save N-1 ssh connections, each taking 2 seconds (divided by the parallelism).

This optimization comes from #4389

Tested (run the relevant ones):