skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

make --fast robust against credential or wheel updates #4289

Open cg505 opened 2 weeks ago

cg505 commented 2 weeks ago

The result is that --fast will reprovision the cluster if some important things change (such as new cloud credentials or a new version of the skypilot wheel).

However, this does cause some performance regression in the --fast case since we need to go through the initial provisioning stages. That costs about 4s on my machine. I am looking into optimizing this - it's mostly an unnecessary roundtrip checking that the cluster is up.

Tested (run the relevant ones):