SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
If a cluster is mid-initialization, its status will be INIT and autostop/down will not be set yet. In this case, the cluster refresh won't actually grab the cluster status lock and hard refresh the status. So, check_cluster_available will immeidately decide that the cluster is INIT and throw.
This could cause a bug where many parallel launches of sky launch --fast that are staggered can all decide that the cluster is INIT, and all decide that they need to launch the cluster. Since cluster initialization is locked with the cluster status lock, each invocation will sychronously re-launch the cluster.
Now, if we see that the cluster is INIT, we force a refresh. This will acquire the cluster status lock, which will block until any ongoing provisioning completes and the cluster is UP. If the cluster is otherwise INIT (e.g. ray cluster has been stopped abnormally) then provisioning should proceed as normal.
This does not fix the race where the cluster does not exist or is STOPPED, and many simultaneously started sky launch --fast invocations try to create or restart the cluster. However, once the first batch complete their launches, all future invocations should correctly see the cluster as UP, not INIT - even if they are started while the first batch is still provisioning the cluster. Fixing the STOPPED or non-existent case is a bit more difficult and will probably require moving this detection logic inside the provisioning code, so that it holds the cluster status lock continuously from the status check until the cluster is UP.
Tested (run the relevant ones):
[x] Code formatting: bash format.sh
[ ] Any manual or new tests for this PR (please specify below)
If a cluster is mid-initialization, its status will be INIT and autostop/down will not be set yet. In this case, the cluster refresh won't actually grab the cluster status lock and hard refresh the status. So, check_cluster_available will immeidately decide that the cluster is INIT and throw.
This could cause a bug where many parallel launches of
sky launch --fast
that are staggered can all decide that the cluster is INIT, and all decide that they need to launch the cluster. Since cluster initialization is locked with the cluster status lock, each invocation will sychronously re-launch the cluster.Now, if we see that the cluster is INIT, we force a refresh. This will acquire the cluster status lock, which will block until any ongoing provisioning completes and the cluster is UP. If the cluster is otherwise INIT (e.g. ray cluster has been stopped abnormally) then provisioning should proceed as normal.
This does not fix the race where the cluster does not exist or is STOPPED, and many simultaneously started
sky launch --fast
invocations try to create or restart the cluster. However, once the first batch complete their launches, all future invocations should correctly see the cluster as UP, not INIT - even if they are started while the first batch is still provisioning the cluster. Fixing the STOPPED or non-existent case is a bit more difficult and will probably require moving this detection logic inside the provisioning code, so that it holds the cluster status lock continuously from the status check until the cluster is UP.Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh