Open hobbsh opened 5 years ago
Looks to be related to issue #89 in terraform-aws-eks
This actually is not related to protect_from_scale_in but how cluster-autoscaler is designed to work. Before, we could set maxSize to 0 and all instances in the ASG would be terminated. cluster-autoscaler gets confused when minSize and maxSize are both set to zero in an ASG and does nothing.
To work around this (and as part of the process anyways), running a drain on the node group and setting minSize of that node group in the ASG to 0 will make cluster-autoscaler to terminate those nodes. cluster-autoscaler will either need to respect a maxSize of 0 or look for some tag on an ASG to make this a one-step process.
@hobbsh,
Did you make any more progress on this? am current running into this and would be great to see if you came up with a better solution yet!
The thing that confuses me is how do I tell the AS to use the new group only once I drain the old group? currently I am doing the following:
Another idea I have is to just do the following, which in theory should stop the blue group being used at all during the drain process:
Any thoughts from your experience with this?
@stefansedich This issue is tracking the root of the problem, which is that cluster-autoscaler does not respect max_size 0 (and in fact gets totally confused by it). I should rename this issue because protect_from_scale_in is not really the problem.
The way I have been doing it at the moment is basically the first scenario you mentioned. Bring up new green group, drain blue (cluster-autoscaler eventually reaps these as unneeded) then finally set blue min_size to 0 (you should actually leave max_size alone now if you're using cluster-autoscaler). You can set the unneeded node termination time in cluster-autoscaler if you want them to be terminated faster.
If you're not using cluster-autoscaler, I was also not able to find a good solution to automatically drain the nodes. I think cluster-autoscaler is much better setup to automatically drain, its just that there doesn't seem to be a way to force a node group to terminate. Unfortunately, the cluster-autoscaler issue mentioned above has gone cold.
Thanks!
How do you stop the CA creating new nodes in the old blue group while you cut over? I guess you dial down the CA until step #5 is complete and the old blue group has had min and max set to 0?
On Jan 23, 2019 9:11 AM, "Wylie Hobbs" notifications@github.com wrote:
@stefansedich https://github.com/stefansedich This issue https://github.com/kubernetes/autoscaler/issues/1555 is tracking the root of the problem, which is that cluster-autoscaler does not respect max_size 0 (and in fact gets totally confused by it). I should rename this issue because protect_from_scale_in is not really the problem.
The way I have been doing it at the moment is basically the first scenario you mentioned. Bring up new green group, drain blue (cluster-autoscaler eventually reaps these as unneeded) then finally set blue min_size to 0 (you should actually leave max_size alone now if you're using cluster-autoscaler). You can set the unneeded node termination time in cluster-autoscaler if you want them to be terminated faster.
If you're not using cluster-autoscaler, I was also not able to find a good solution to automatically drain the nodes. I think cluster-autoscaler is much better setup to automatically drain, its just that there doesn't seem to be a way to force a node group to terminate. Unfortunately, the cluster-autoscaler issue mentioned above has gone cold.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hobbsh/blue-green-eks-worker/issues/1#issuecomment-456886530, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIqVLi1ytoEWTMwGI-IdijmaHZOVfgiks5vGJfDgaJpZM4Zc8y1 .
I guess that's something you would have to set a good max_size to prevent - it's possible new nodes would get created in blue but not likely unless you have aggressive scaling parameters set in cluster-autoscaler. I'm not positive on this but the scheduler should prioritize the new node group anyways as their utilization would be much lower, although there are likely a lot of cases that could disrupt that. Are you actually running into that issue or is it theoretical at this point?
Just theory at this point @hobbsh but looking at the AS defaults it looks like "random" placement is the default.
I think I am just going to attempt a script that will simply remove instance protection on the nodes in the group to remove so then I can just do and look into the issue you logged within AS at some point to see if it can be made to respect max=0 nicely.
Removing protect_from_scale in is not recommended as it will allow both the ASG and CA to be able to perform autoscaling duties. Maybe it's not a big deal for you but just wanted to let you know.
@hobbsh correct, I mean to only do it at the point I have set the old groups min and max to 0, the nodes are drained and I want them to go away for good.
Because from what I understand setting max to 0 will make AS ignore that group anyway, so all I want to do at that point is nuke those nodes.
Setting max_size to 0 will make CA ignore it because it gets confused and is not expecting that value to be 0. The AWS ASG will respect it when you take off instance protection like you said and that is a way of terminating the node group.
I've found it's easier to just wait for CA to terminate the nodes after they are drained. It will take ~10-15 for the nodes to be terminated in this case by default unless you tweak some of the CA scaling args.
Whatever works for you though! I'll be continuing to find a better (one-step) way of doing this.
Keep me posted if you find anything better!
I just whipped up a script to do the drain and then remove protection upon user confirmation and it appears to work nicely!
On Wed, Jan 23, 2019 at 1:16 PM Wylie Hobbs notifications@github.com wrote:
Setting max_size to 0 will make CA ignore it because it gets confused and is not expecting that value to be 0. The AWS ASG will respect it when you take off instance protection like you said and that is a way of terminating the node group.
I've found it's easier to just wait for CA to terminate the nodes after they are drained. It will take ~10-15 for the nodes to be terminated in this case by default unless you tweak some of the CA scaling args.
Whatever works for you though! I'll be continuing to find a better (one-step) way of doing this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hobbsh/blue-green-eks-worker/issues/1#issuecomment-456969187, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIqVHOffso-uqFGiAaUanTa9W4jhpLiks5vGNEtgaJpZM4Zc8y1 .
-- Cheers Stefan
Hi @hobbsh , is there an update on this? Thanks
According to the way the terraform-aws-eks module wants autoscaling to work with cluster-autoscaler, this flag is needed to prevent the ASG from doing any scaling: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/autoscaling.md
A few possible workarounds could be to completely remove the flag or use a null-resource local-exec to force delete the instances when the ASG is scaled down.