Add ability to have different CoolUp vs CoolDown timeouts

hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.

Mozilla Public License 2.0

427 stars 84 forks source link

Add ability to have different CoolUp vs CoolDown timeouts #272

Open urjitbhatia opened 4 years ago

urjitbhatia commented 4 years ago

Hi guys,

In our home-grown autoscaler, we have different cooldown policies for scaling up v/s down allowing our services to respond to rapid traffic spikes by scaling up much more aggressively than scaling down.

By having a single value, an autoscaler will be forced to flip-flop more when the resource consumption is right around the edge since users are forced to use a shorter cooldown to account for fast scale-ups. What do you guys think? Happy to send a quick PR if there is any appetite for this.

cgbaker commented 4 years ago

hi @urjitbhatia , this sounds good to me! we'll put this on the list of features, and we would definitely consider a PR on this. thanks for your suggestion!

lgfa29 commented 4 years ago

Hi @urjitbhatia,

That's a good point and something we have plans to fix. The biggest issue right now is that when a cooldown is triggered the goroutine that tracks the policy is blocked, so even if a load increase happens there's nothing looking for it. There's also no way right to "wake up" the policy handler apart from shutting it down.

We will refactor the entire policy evaluation flow in a way that allows for this (and other) capabilities to be easily added, including different strategies to avoid count flapping.

But as @cgbaker mentioned, we would gladly consider a PR if you find a quick fix. I would just not try too hard right now as better things are coming 😄

elcomtik commented 2 years ago

any news?

lgfa29 commented 2 years ago

any news?

No updates on this yet unfortunately. The improvement that I mentioned before turned out to be a bit more complex than we expected so it hasn't been implemented yet.

the-maldridge commented 2 years ago

This bit me earlier today. Is there a recommended way to do horizontal cluster scaling until this is fixed? Without this you wind up with either a cluster that has no capacity for an extended period of time, or one that thrashes machines.

lgfa29 commented 2 years ago

This bit me earlier today. Is there a recommended way to do horizontal cluster scaling until this is fixed? Without this you wind up with either a cluster that has no capacity for an extended period of time, or one that thrashes machines.

@the-maldridge I can't think of anything in particular that would help with this 😞

I will try to revisit this in the coming days. The first problem is that the cooldown is checked inadvertently in a few places, so there's an unfortunate dynamic on how the sub-components interact with it.

Eyald6 commented 1 year ago

Any updates? Would really like to see this feature implemented, as scaling-up should be top priority, even if we scaled down recently

jrasell commented 1 year ago

Hi @Eyald6, no updates currently. When we do have one, the engineer assigned will respond to this issue and assign themself.

peter-lockhart-pub commented 8 months ago

Is there a way of tracking how 1 feature is prioritised over another? I would also love this feature but it has been approved for 3 years now. Scaling out on AWS instances can be quite slow (taking e.g. 12 minutes to boot a G instance) but terminating an instance via the autoscaler can be really quick (just a few minutes). Add on with our infrastructure we work with some large Unreal engine/windows containers and the boot times can be 20-30 minutes. Having a scaling policy use the slowest value (the scale out timeout) makes scaling hard/wasteful as you have to be more conservative

Jamesits commented 7 months ago

Is there any progress on this issue? We are heavily blocked by this issue during a transition to Nomad.