jrasell / sherpa

Sherpa is a highly available, fast, and flexible horizontal job scaling for HashiCorp Nomad. It is capable of running in a number of different modes to suit different requirements, and can scale based on Nomad resource metrics or external sources.
Mozilla Public License 2.0
163 stars 8 forks source link

Add cooldown functionality to job group scaling #49

Closed jrasell closed 4 years ago

jrasell commented 4 years ago

Describe the solution you'd like. Cooldown is a feature of autoscaling which can help ensure previous scaling activities have a chance to impact the load on an application before another scaling event is triggered. The recent addition of scaling event state tracking, now allows Sherpa to include cooldowns within its scaling decision tree.

AWS - https://docs.aws.amazon.com/autoscaling/ec2/userguide/Cooldown.html Google - https://cloud.google.com/compute/docs/autoscaler/ Microsoft - https://docs.microsoft.com/en-us/azure/azure-monitor/platform/autoscale-virtual-machine-scale-sets?toc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fvirtual-machine-scale-sets%2FTOC.json&bc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fbread%2Ftoc.json

numiralofe commented 4 years ago

hi all,

Exactly i would like to add that on legacy apps, those that take more than 60secs to start :( what i have observed is that even before the first scaling action is completed (app up & running & receiving traffic) sherpa is already firing up another instance of the app, which by its turn, In my case, i also have an autoscale group for the nomad client cluster nodes, this causes the following collateral issue: new nomad client nodes are also created by the cloud provider scaler because of the cpu spike that multiple instances of the app cause since they are all almost starting simultaneously :)

P.S - instead of time counter like mentioned on the aws documents, for those jobs that have a health service check, sherpa could wait that the health check returns an "ok" state before proceeding to the next scale event.

jrasell commented 4 years ago

@numiralofe as an initial help it might be useful to set the --autoscaler-evaluation-interval a little higher to compensate for the issues you're seeing.

In regards to the design, I agree monitoring the health would be the ideal way to perform this, but I think it will be out of scope for the initial work. Once I have some of the basic functionality going, i'll be sure to open follow up issues to track this option.