Hot standby question - Githubissues

dpogorzelski commented 3 years ago

Hey everyone :) Is there an official way of to define a scaling policy (horizontal scaling) in such way that the amount of nodes, at any given time, is N+Z? Where N is the actual number of needed nodes based on APM data and Y is the safety buffer to cover for unexpected spikes. Y in this case is a fixed number like 1, so the autoscaler would always make sure the number of nodes is N+1 for example. Other buffering strategies might also be possible/desirable. Thanks :)

cgbaker commented 3 years ago

Hi @dpogorzelski , that's a cool use case. The current target-value plugin doesn't support that, but it seems like a simple concept to support an additional capacity as a parameter. We'll keep this open for the roadmap, but interested parties should feel free to submit a PR.

artemantipov commented 2 years ago

Hello there, I was thinking about implementation of that concept and a couple things appeared in my mind which might make it not so simple as it look. Like after we added additional spare node to delta capacity, should we exclude this spare number of nodes from APM calculation query, how we could do that and will it work with other APMs like Prometheus for example. If I overcomplicate things could you please point me in right direction @cgbaker so I could try to implement and contribute changes from our side.

leosunmo commented 2 years ago

Hey, I'm quite keen on reviving this issue. Have there been any developments in this area at all? If not, could anyone a bit more experienced with Nomad and the autoscaler give some pointers on where this should be implemented? I will have a crack at it as we need it, but I'd love to get some ideas of where to start. :)

Like Artem said above, there's a few places which we could put it but it will probably cause issues I can't currently foresee.

lgfa29 commented 2 years ago

Hi @leosunmo and @artemantipov 👋

Thank you for the interest in taking on this, and apologies for the delay in getting back to you.

So I think this could be implemented in the runTargetScale function. There's quite a bit going on, but the BaseWorker is the component responsible for evaluating an expression, which means reading the target status, querying the APM for metrics, running the policy strategy, and, finally, applying the scaling action onto the target.

This final step is done by the runTargetScale function, and I think we could just add the Y @dpogorzelski mentioned. This value could be set in and read from the policy as as new configuration value. There's quite a bit of plumbing that needs to happen to get a new config from file all the way to the worker, but maybe https://github.com/hashicorp/nomad-autoscaler/pull/567 can serve as a guide.

Here are some diagrams that sketched for an internal document, they may be handy to understand how everything fits together:

Feel free to reach out if you have any more questions 🙂

lgfa29 commented 2 years ago

After answering this I came across https://github.com/hashicorp/nomad-autoscaler/issues/577 and made a change that would impact this work slightly.

Instead of changing runTargetScale as I mentioned before, you would apply the standby units in the new scaleTarget function.

One important consideration is how to apply the min and max interval. Would N+Y always be enforced to be within these limits, or would just N?

I think we would always want to be within [min:max], so this check needs to be included in this work, but I would be curious to hear your thoughts as well.

hashicorp / nomad-autoscaler

Hot standby question #426