hobbsh / blue-green-eks-worker

Some documentation and code for managing blue/green EKS workers
18 stars 7 forks source link

cluster-autoscaler not terminating nodes when max_size is set to 0 #1

Open hobbsh opened 5 years ago

hobbsh commented 5 years ago

According to the way the terraform-aws-eks module wants autoscaling to work with cluster-autoscaler, this flag is needed to prevent the ASG from doing any scaling: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/autoscaling.md

A few possible workarounds could be to completely remove the flag or use a null-resource local-exec to force delete the instances when the ASG is scaled down.

hobbsh commented 5 years ago

Looks to be related to issue #89 in terraform-aws-eks

hobbsh commented 5 years ago

This actually is not related to protect_from_scale_in but how cluster-autoscaler is designed to work. Before, we could set maxSize to 0 and all instances in the ASG would be terminated. cluster-autoscaler gets confused when minSize and maxSize are both set to zero in an ASG and does nothing.

To work around this (and as part of the process anyways), running a drain on the node group and setting minSize of that node group in the ASG to 0 will make cluster-autoscaler to terminate those nodes. cluster-autoscaler will either need to respect a maxSize of 0 or look for some tag on an ASG to make this a one-step process.

stefansedich commented 5 years ago

@hobbsh,

Did you make any more progress on this? am current running into this and would be great to see if you came up with a better solution yet!

The thing that confuses me is how do I tell the AS to use the new group only once I drain the old group? currently I am doing the following:

  1. Bring up green
  2. Drain blue
  3. Set blue ASG to min_size=0
  4. Wait for nodes to go away
  5. Set blue ASX min=0 max=0

Another idea I have is to just do the following, which in theory should stop the blue group being used at all during the drain process:

  1. Bring up green and set blue min=0, max=0, autoscaling_enabled=false
  2. Drain blue nodes
  3. Remove instance protection for all nodes in the blue group and let the ASG take care of nuking the nodes for me

Any thoughts from your experience with this?

hobbsh commented 5 years ago

@stefansedich This issue is tracking the root of the problem, which is that cluster-autoscaler does not respect max_size 0 (and in fact gets totally confused by it). I should rename this issue because protect_from_scale_in is not really the problem.

The way I have been doing it at the moment is basically the first scenario you mentioned. Bring up new green group, drain blue (cluster-autoscaler eventually reaps these as unneeded) then finally set blue min_size to 0 (you should actually leave max_size alone now if you're using cluster-autoscaler). You can set the unneeded node termination time in cluster-autoscaler if you want them to be terminated faster.

If you're not using cluster-autoscaler, I was also not able to find a good solution to automatically drain the nodes. I think cluster-autoscaler is much better setup to automatically drain, its just that there doesn't seem to be a way to force a node group to terminate. Unfortunately, the cluster-autoscaler issue mentioned above has gone cold.

stefansedich commented 5 years ago

Thanks!

How do you stop the CA creating new nodes in the old blue group while you cut over? I guess you dial down the CA until step #5 is complete and the old blue group has had min and max set to 0?

On Jan 23, 2019 9:11 AM, "Wylie Hobbs" notifications@github.com wrote:

@stefansedich https://github.com/stefansedich This issue https://github.com/kubernetes/autoscaler/issues/1555 is tracking the root of the problem, which is that cluster-autoscaler does not respect max_size 0 (and in fact gets totally confused by it). I should rename this issue because protect_from_scale_in is not really the problem.

The way I have been doing it at the moment is basically the first scenario you mentioned. Bring up new green group, drain blue (cluster-autoscaler eventually reaps these as unneeded) then finally set blue min_size to 0 (you should actually leave max_size alone now if you're using cluster-autoscaler). You can set the unneeded node termination time in cluster-autoscaler if you want them to be terminated faster.

If you're not using cluster-autoscaler, I was also not able to find a good solution to automatically drain the nodes. I think cluster-autoscaler is much better setup to automatically drain, its just that there doesn't seem to be a way to force a node group to terminate. Unfortunately, the cluster-autoscaler issue mentioned above has gone cold.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hobbsh/blue-green-eks-worker/issues/1#issuecomment-456886530, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIqVLi1ytoEWTMwGI-IdijmaHZOVfgiks5vGJfDgaJpZM4Zc8y1 .

hobbsh commented 5 years ago

I guess that's something you would have to set a good max_size to prevent - it's possible new nodes would get created in blue but not likely unless you have aggressive scaling parameters set in cluster-autoscaler. I'm not positive on this but the scheduler should prioritize the new node group anyways as their utilization would be much lower, although there are likely a lot of cases that could disrupt that. Are you actually running into that issue or is it theoretical at this point?

stefansedich commented 5 years ago

Just theory at this point @hobbsh but looking at the AS defaults it looks like "random" placement is the default.

I think I am just going to attempt a script that will simply remove instance protection on the nodes in the group to remove so then I can just do and look into the issue you logged within AS at some point to see if it can be made to respect max=0 nicely.

  1. Bring up green and set min and max to 0 for the blue group to effectively disable it
  2. Drain blue nodes
  3. Run script to remove instance protection on the blue nodes and let ASG handle termination
hobbsh commented 5 years ago

Removing protect_from_scale in is not recommended as it will allow both the ASG and CA to be able to perform autoscaling duties. Maybe it's not a big deal for you but just wanted to let you know.

stefansedich commented 5 years ago

@hobbsh correct, I mean to only do it at the point I have set the old groups min and max to 0, the nodes are drained and I want them to go away for good.

Because from what I understand setting max to 0 will make AS ignore that group anyway, so all I want to do at that point is nuke those nodes.

hobbsh commented 5 years ago

Setting max_size to 0 will make CA ignore it because it gets confused and is not expecting that value to be 0. The AWS ASG will respect it when you take off instance protection like you said and that is a way of terminating the node group.

I've found it's easier to just wait for CA to terminate the nodes after they are drained. It will take ~10-15 for the nodes to be terminated in this case by default unless you tweak some of the CA scaling args.

Whatever works for you though! I'll be continuing to find a better (one-step) way of doing this.

stefansedich commented 5 years ago

Keep me posted if you find anything better!

I just whipped up a script to do the drain and then remove protection upon user confirmation and it appears to work nicely!

On Wed, Jan 23, 2019 at 1:16 PM Wylie Hobbs notifications@github.com wrote:

Setting max_size to 0 will make CA ignore it because it gets confused and is not expecting that value to be 0. The AWS ASG will respect it when you take off instance protection like you said and that is a way of terminating the node group.

I've found it's easier to just wait for CA to terminate the nodes after they are drained. It will take ~10-15 for the nodes to be terminated in this case by default unless you tweak some of the CA scaling args.

Whatever works for you though! I'll be continuing to find a better (one-step) way of doing this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hobbsh/blue-green-eks-worker/issues/1#issuecomment-456969187, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIqVHOffso-uqFGiAaUanTa9W4jhpLiks5vGNEtgaJpZM4Zc8y1 .

-- Cheers Stefan

atamgp commented 4 years ago

Hi @hobbsh , is there an update on this? Thanks