Open urjitbhatia opened 4 years ago
hi @urjitbhatia , this sounds good to me! we'll put this on the list of features, and we would definitely consider a PR on this. thanks for your suggestion!
Hi @urjitbhatia,
That's a good point and something we have plans to fix. The biggest issue right now is that when a cooldown is triggered the goroutine that tracks the policy is blocked, so even if a load increase happens there's nothing looking for it. There's also no way right to "wake up" the policy handler apart from shutting it down.
We will refactor the entire policy evaluation flow in a way that allows for this (and other) capabilities to be easily added, including different strategies to avoid count flapping.
But as @cgbaker mentioned, we would gladly consider a PR if you find a quick fix. I would just not try too hard right now as better things are coming 😄
any news?
any news?
No updates on this yet unfortunately. The improvement that I mentioned before turned out to be a bit more complex than we expected so it hasn't been implemented yet.
This bit me earlier today. Is there a recommended way to do horizontal cluster scaling until this is fixed? Without this you wind up with either a cluster that has no capacity for an extended period of time, or one that thrashes machines.
This bit me earlier today. Is there a recommended way to do horizontal cluster scaling until this is fixed? Without this you wind up with either a cluster that has no capacity for an extended period of time, or one that thrashes machines.
@the-maldridge I can't think of anything in particular that would help with this 😞
I will try to revisit this in the coming days. The first problem is that the cooldown is checked inadvertently in a few places, so there's an unfortunate dynamic on how the sub-components interact with it.
Any updates? Would really like to see this feature implemented, as scaling-up should be top priority, even if we scaled down recently
Hi @Eyald6, no updates currently. When we do have one, the engineer assigned will respond to this issue and assign themself.
Is there a way of tracking how 1 feature is prioritised over another? I would also love this feature but it has been approved for 3 years now. Scaling out on AWS instances can be quite slow (taking e.g. 12 minutes to boot a G instance) but terminating an instance via the autoscaler can be really quick (just a few minutes). Add on with our infrastructure we work with some large Unreal engine/windows containers and the boot times can be 20-30 minutes. Having a scaling policy use the slowest value (the scale out timeout) makes scaling hard/wasteful as you have to be more conservative
Is there any progress on this issue? We are heavily blocked by this issue during a transition to Nomad.
Hi guys,
In our home-grown autoscaler, we have different cooldown policies for scaling up v/s down allowing our services to respond to rapid traffic spikes by scaling up much more aggressively than scaling down.
By having a single value, an autoscaler will be forced to flip-flop more when the resource consumption is right around the edge since users are forced to use a shorter cooldown to account for fast scale-ups. What do you guys think? Happy to send a quick PR if there is any appetite for this.