skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.78k stars 509 forks source link

[Feature] Autostop definition in YAML? #3953

Open jucor opened 1 month ago

jucor commented 1 month ago

Is it possible to add an autostop field in the YAML for creating a cluster, please? I do not see any such way in https://github.com/skypilot-org/skypilot/blob/master/docs/source/reference/yaml-spec.rst

The current way to implement autostop seems to require running a command after-launch. Guess who forgot to run it a month ago and just got hit by a multi-hundred dollar bill from EC2? 💸 💸 💸

Adding autostop in the cluster definition would save me (and probably others) from shooting themselves in the foot again. Thanks !

jucor commented 1 month ago

I note that #1435 got hit with a similar need, and suggested another way: setting a default value for autostop. Both of these issue (and my 500 USD wasted EC2 bill 😢 ) show there's a need for it.
@Michaelvll, do you know if this is possible, please?

Michaelvll commented 1 month ago

Good point! Sorry to hear about the wasted bill. We debated for a while whether to include the autostop setting in task YAML, since it is more about the task itself than the config for a cluster. One proposal is to have autostop_after_idle_minutes: under the resources section for the task YAML. Wdyt?

jucor commented 1 month ago

The proposed autostop_after_idle_minutes in the YAML file seems, to me, a perfect solution for this need!

I did think about the task vs cluster: while it's true that this is a cluster parameter, a common (and simple) usage pattern, even used in the Quickstart, is to define a task in a YAML and use sky launch on it which automatically provisions a cluster for it. To be honest, I'm not even sure how I would create a cluster without a task description (except by using command line arguments, but that's more cumbersome than defining a YAML for a task). That's how easy it is to have ad-hoc clusters :)

As to the wasted bill, ah well, that's part of the risks of using the cloud: we all have been or will be hit by such a mistake once in our job 😬 🤷 You provide powerful tools, you're not responsible for how I (fail to) use it 😆 But I do appreciate your receptivity to helping make it easier!

Michaelvll commented 1 month ago

A better interface might be that:

resources:
  autostop:
    down: true
    idle_minutes: 10

With this it can support autodown as well.

jucor commented 1 month ago

Sounds perfect too! Whichever is easiest for you to implement, really.