Autoscaler ignores --spare-agents and --over-provision parameters unless there are pending pods

jtv8 commented 7 years ago

At present, if the user supplies either the --spare-agents or --over-provision parameters, the autoscaler does not provision the requested nodes unless there is at least one pending pod.

This is important, as a cluster admin may choose to use these parameters as overrides to cover scenarios that the scaler does not know about - for example, if the admin knows that an application requires a minimum number of agents due to anti-affinity rules (see https://github.com/wbuchwalter/Kubernetes-acs-engine-autoscaler/issues/65).

Possible cause: this appears to be because this logic is only processed as part of the fulfill_pending method in autoscaler/scaler.py, which only gets run when the set of pending pods is non-empty.

wbuchwalter commented 7 years ago

This is as designed. --spare-agents instructs the AS to not scale-in under the specified number. --over-provision instructs the AS to scale out an additional number of nodes when scaling is necessary (for example if you know that when your loads starts picking up, it grows very fast).

For example, I know of some other people using this autoscaler that are setting --spare-agent to 20, But don't want the AS to scale to 20 when the number of VMs is under this number. Say an admin deleted most of the VMs because he knows there won't be any load during the week-end, but load might be highly unpredictable during the week and thus doesn't want the AS to make any decision by itself under 20 nodes.

I guess a good solution would be to add another parameter such as --force-spare to instruct the AS to make sure the cluster is always at least --spare-agent's size.

For now a very dirty solution for you is to manually create as many pending pods as needed to force the AS to scale up to --spare-agents. The AS will then never go under this number unless you manually remove some VMs.

yaron-idan commented 7 years ago

I've ran into strange behavior that seems connected to this issue, when scaling in from 5 nodes with --spare-agents=2 I've had 2 nodes deleted at the same scaling loop, causing the cluster to stay with 1 node. I'm attaching the logs from the scaling event that caused this - https://gist.github.com/yaron-idan/be4784fb4a874331bd9b4850cb6eeac8 This happened when using "orchestratorRelease": "1.8" in the acs-engine config, is this version compatible and tested against the autoscaler?

wbuchwalter / Kubernetes-acs-engine-autoscaler

Autoscaler ignores --spare-agents and --over-provision parameters unless there are pending pods #66