skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[ux] cache cluster status of autostop or spot clusters for 2s #4332

Closed cg505 closed 6 days ago

cg505 commented 1 week ago

Previously, every time we want the status of a cluster with spot VMs or with autostop set, we will fetch the latest actual status from the cloud. This is needed since these clusters can be terminated from "outside", and the state in the local state database will be out of date.

However, we often end up fetching the status multiple times in a single invocation. For instance, sky launch will check the status in cli.py, then again almost immediately after as part of the provisioning codepath.

To mitigate this, we can keep track of the last time we fetched the status from the cloud. If it is within the past 2 seconds, assume that it's still accurate (that is, the cluster hasn't been terminated/stopped since then).

Caveats:

Performance impact:

Tested (run the relevant ones):