Open jucor opened 1 month ago
I note that #1435 got hit with a similar need, and suggested another way: setting a default value for autostop. Both of these issue (and my 500 USD wasted EC2 bill 😢 ) show there's a need for it.
@Michaelvll, do you know if this is possible, please?
Good point! Sorry to hear about the wasted bill. We debated for a while whether to include the autostop setting in task YAML, since it is more about the task itself than the config for a cluster. One proposal is to have autostop_after_idle_minutes:
under the resources
section for the task YAML. Wdyt?
The proposed autostop_after_idle_minutes
in the YAML file seems, to me, a perfect solution for this need!
I did think about the task vs cluster: while it's true that this is a cluster parameter, a common (and simple) usage pattern, even used in the Quickstart, is to define a task in a YAML and use sky launch
on it which automatically provisions a cluster for it.
To be honest, I'm not even sure how I would create a cluster without a task description (except by using command line arguments, but that's more cumbersome than defining a YAML for a task). That's how easy it is to have ad-hoc clusters :)
As to the wasted bill, ah well, that's part of the risks of using the cloud: we all have been or will be hit by such a mistake once in our job 😬 🤷 You provide powerful tools, you're not responsible for how I (fail to) use it 😆 But I do appreciate your receptivity to helping make it easier!
A better interface might be that:
resources:
autostop:
down: true
idle_minutes: 10
With this it can support autodown as well.
Sounds perfect too! Whichever is easiest for you to implement, really.
Is it possible to add an
autostop
field in the YAML for creating a cluster, please? I do not see any such way in https://github.com/skypilot-org/skypilot/blob/master/docs/source/reference/yaml-spec.rstThe current way to implement autostop seems to require running a command after-launch. Guess who forgot to run it a month ago and just got hit by a multi-hundred dollar bill from EC2? 💸 💸 💸
Adding
autostop
in the cluster definition would save me (and probably others) from shooting themselves in the foot again. Thanks !