hellofresh / eks-rolling-update

EKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster.
Apache License 2.0
362 stars 80 forks source link

Feature request - cordon one node at a time instead of all nodes #91

Closed infa-ddeore closed 3 years ago

infa-ddeore commented 3 years ago

with RUN_MODE=1 all old nodes are cordoned at a same time, which makes AWS ELB to mark old nodes out of service, if new nodes sometimes take time to be in service then no healthy instances are left for sometime which causes outage

we tried cordnoning 1 node at a time and didnt see this issue, downside of this is that a pod may bounce multiple times because it may land up on old node because not all old nodes are cordoned, some people will be fine with bouncing one of a pod multiple times among multiple replicas.

can we have RUN_MODE 5 which is same as RUN_MODE 1 except it "cordon 1 node --> drain 1 node --> delete 1 node" at a time instead of "cordon all nodes --> drain 1 node --> delete 1 node"

chadlwilson commented 3 years ago

I believe this is already supported by #49 (this issue is, i believe, a dupe of #48) which replaces cordoning with tainting nodes when you set env TAINT_NODES=true, consistent with the workarounds suggested in https://github.com/kubernetes/kubernetes/issues/65013 Can you try that out?

I see now that TAINT_NODES seems to be missing from the docs; I can try and PR a docs fix to address that.

chadlwilson commented 3 years ago

Not everyone will see this problem as if you use an infra-managed load balancer (e.g a Terraform-managed NLB) rather than a Service-managed load balancer, I don't believe this issue exists. Also, I understand the kubernetes behaviour on cordon has been changed to address this in 1.19 per https://github.com/kubernetes/kubernetes/pull/90823

infa-ddeore commented 3 years ago

thanks for the quick reply @chadlwilson TAINT_NODES=true looks good option, will try that out ours is 1.15 k8 so the cordon behaviour change wont help us

infa-ddeore commented 3 years ago

@chadlwilson with TAINT_NODES=true does it also cordon a single node?

"taint all (all node stay into ELB) --> cordon one (single node goes out of ELB) --> drain and delete one node" is more graceful than letting ELB remove the node from service based on health threshold

if cordon isnt there then ELB will remove node ungracefully during node's termination

chadlwilson commented 3 years ago

It doesn't do any cordoning - it's an alternative strategy.

Interacting with LBs isn't the purpose of cordoning to my knowledge - cordoning is about preventing scheduling of new workloads - the effect on the service-managed LBs is an unintended side effect which I believe is why they have removed it in Kubernetes 1.19.

The tool uses terminate_instance_in_auto_scaling_group to orchestrate termination of instances in an ASG-aware fashion, and thus ensure your target group deregistration delay is respected; allowing any remaining traffic to drain off the instance before it is actually terminated.

Perhaps you can try it out - I think you will find it does what you expect :-)

infa-ddeore commented 3 years ago

thanks for the explanation, it does what we expect exactly :-) I am closing this request as TAINT_NODES=true option does exactly what we want