skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.84k stars 515 forks source link

[UX] Launch on existing cluster should be very fast #4157

Open Michaelvll opened 1 month ago

Michaelvll commented 1 month ago

A user reported that they are running sky launch but they find sky launch on existing cluster is very slow and the expect behavior is that:

  1. if cluster does not exist, provision the cluster and run the job
  2. if the cluster exists, run the job only (like exec), and skip all those time consuming steps, including skypilot runtime setup, waiting for ssh, and user setup.

Two ways to achieve this:

  1. Make the sky launch super fast on an existing cluster by caching the current state of a cluster and only re-setup the cluster when the runtime is stale.
  2. add an option to automatically use sky.exec when sky launch is run on an existing cluster.
cg505 commented 1 week ago

This is mostly solved by sky launch --fast. This is not turned on by default since it's very hard to tell when setup should be re-run. We could probably turn the provisioning short-circuit in #4289 on by default.