SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Introduce a "config hash", which is a deterministic hash of the ray cluster config and all the referenced file mounts. We expect this value to stay the same when running launch on an existing cluster if all else is equal.
Store the current config hash for each cluster into the global_user_state DB.
Add a flag to provisioning that allows the provisioning path to short-circuit if the calculated config hash is the same as the one that should already be present on the cluster.
Use this new path for --fast.
The result is that --fast will reprovision the cluster if some important things change (such as new cloud credentials or a new version of the skypilot wheel).
However, this does cause some performance regression in the --fast case since we need to go through the initial provisioning stages. That costs about 4s on my machine. I am looking into optimizing this - it's mostly an unnecessary roundtrip checking that the cluster is up.
Tested (run the relevant ones):
[x] Code formatting: bash format.sh
[x] Any manual or new tests for this PR (please specify below)
[x] Manually test that updating credential files is caught
[x] Manually test that new skypilot wheel is caught
The result is that --fast will reprovision the cluster if some important things change (such as new cloud credentials or a new version of the skypilot wheel).
However, this does cause some performance regression in the --fast case since we need to go through the initial provisioning stages. That costs about 4s on my machine. I am looking into optimizing this - it's mostly an unnecessary roundtrip checking that the cluster is up.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_minimal
pytest tests/test_smoke.py:: --managed-jobs
conda deactivate; bash -i tests/backward_compatibility_tests.sh