SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Previously, every time we want the status of a cluster with spot VMs or with autostop set, we will fetch the latest actual status from the cloud. This is needed since these clusters can be terminated from "outside", and the state in the local state database will be out of date.
However, we often end up fetching the status multiple times in a single invocation. For instance, sky launch will check the status in cli.py, then again almost immediately after as part of the provisioning codepath.
To mitigate this, we can keep track of the last time we fetched the status from the cloud. If it is within the past 2 seconds, assume that it's still accurate (that is, the cluster hasn't been terminated/stopped since then).
Caveats:
~The updated timestamp check/set is not atomic, so if multiple parallel invocations check the status, they may all see that it is out of date, and then all try to refresh the status.~
Edit: fixed this in the latest version
This is equivalent to the current behavior, but the optimization won't take effect in this case.
It is possible that the cluster is terminated or stopped in the 2 seconds between the status check and our check. This could cause further operations (e.g. job launch) to fail and potentially crash SkyPilot.
This race is already possible in master since there is always some delay between when we check the status and when we launch a job/do whatever we want to do with the cluster. But now the window for a potential race is increased by up to 2 seconds.
This could be fixed by changing the status check to also send some "intent to use" the cluster, which would reset the idle time when it fetches the status (atomically).
Performance impact:
sky launch --fast skips one status check, saving ~3s (from ~10.5s -> ~7.5s if the cluster is already up).
sky jobs launch --fast is the same, but this will mitigate the performance hit from #4289
Tested (run the relevant ones):
[x] Code formatting: bash format.sh
[x] Manually used sky launch --fast on many autostop cluster to try and make it fail.
Previously, every time we want the status of a cluster with spot VMs or with autostop set, we will fetch the latest actual status from the cloud. This is needed since these clusters can be terminated from "outside", and the state in the local state database will be out of date.
However, we often end up fetching the status multiple times in a single invocation. For instance,
sky launch
will check the status in cli.py, then again almost immediately after as part of the provisioning codepath.To mitigate this, we can keep track of the last time we fetched the status from the cloud. If it is within the past 2 seconds, assume that it's still accurate (that is, the cluster hasn't been terminated/stopped since then).
Caveats:
Performance impact:
sky launch --fast
skips one status check, saving ~3s (from ~10.5s -> ~7.5s if the cluster is already up).sky jobs launch --fast
is the same, but this will mitigate the performance hit from #4289Tested (run the relevant ones):
bash format.sh
sky launch --fast
on many autostop cluster to try and make it fail.pytest tests/test_smoke.py
pytest tests/test_smoke.py --managed-jobs
conda deactivate; bash -i tests/backward_compatibility_tests.sh