wbuchwalter / Kubernetes-acs-engine-autoscaler

[Deprecated] Node-level autoscaler for Kubernetes clusters created with acs-engine.
Other
71 stars 22 forks source link

Speeding up scaling operations #58

Open skuda opened 7 years ago

skuda commented 7 years ago

Hi,

This is not a bug, sorry I didn't find a better way to communicate this!

I have been using the autoscaler and it's working great, but it's somewhat slow, for us adding new nodes is taking approximately 10 minutes.

Maybe our use case is a bit special but we have usually very small load that sometimes go up very fast, the specific service I am speaking about acts as a precomputed cache. If the cache is full, hits are very cheap, if the cache is purged, something that happens 2 or 3 times per week, the load skyrocket for about 2 to 3 hours.

I understand that creating the nodes, installing everything and adding them to the cluster is something that takes its time, but I have been thinking, why not having a specific number of pre-configured nodes, only deallocated, it would be much faster to just put online existing servers that destroy them and recreate them from the very beginning every time, no?

Best, Miguel.

wbuchwalter commented 7 years ago

Hi @skuda,

This is the best place to discuss this :) Keeping deallocated node is not a bad idea, but I think it will be quite complex to implement correctly. What would be an acceptable scaling time in your case?

oryagel commented 7 years ago

We would like to see something like that as well - reduce the starvation time. I was thinking of a different approach, just keep extra nodes alive. For exmaple, if I will configure extra-nodes=2. the autoscaler will always keep extra cores and memory that match the resources of two nodes.

skuda commented 7 years ago

@wbuchwalter For me, 2 or even 3 minutes would be fast enough.

alexquintero commented 6 years ago

I'm not sure this is the fault of the autoscaler itself but rather a function of how long it takes for Azure to spin up a VM for your cluster. In my general tests it takes anywhere from 7-13 minutes to get a new VM in an Availability Set. This is in westus by the way. I wonder if the vm scale up time differs based on region?

I agree with @skuda that a shorter vm time of a few minutes would be ideal. I personally don't think this is possible without using VM Scale Sets. Maybe when those are supported by acs-engine we will get the shorter spin up time.

Or... when we are able to use a stable ACI connector, or some other method of having a virtual kubelet with infinite capacity (serverless containers), then there shouldn't be any VM spin up time anymore.