Closed infa-ddeore closed 3 years ago
I believe this is already supported by #49 (this issue is, i believe, a dupe of #48) which replaces cordoning with tainting nodes when you set env TAINT_NODES=true
, consistent with the workarounds suggested in https://github.com/kubernetes/kubernetes/issues/65013 Can you try that out?
I see now that TAINT_NODES
seems to be missing from the docs; I can try and PR a docs fix to address that.
Not everyone will see this problem as if you use an infra-managed load balancer (e.g a Terraform-managed NLB) rather than a Service
-managed load balancer, I don't believe this issue exists. Also, I understand the kubernetes behaviour on cordon
has been changed to address this in 1.19
per https://github.com/kubernetes/kubernetes/pull/90823
thanks for the quick reply @chadlwilson TAINT_NODES=true
looks good option, will try that out
ours is 1.15 k8 so the cordon behaviour change wont help us
@chadlwilson with TAINT_NODES=true
does it also cordon a single node?
"taint all (all node stay into ELB) --> cordon one (single node goes out of ELB) --> drain and delete one node" is more graceful than letting ELB remove the node from service based on health threshold
if cordon isnt there then ELB will remove node ungracefully during node's termination
It doesn't do any cordoning - it's an alternative strategy.
Interacting with LBs isn't the purpose of cordoning to my knowledge - cordoning is about preventing scheduling of new workloads - the effect on the service-managed LBs is an unintended side effect which I believe is why they have removed it in Kubernetes 1.19.
The tool uses terminate_instance_in_auto_scaling_group
to orchestrate termination of instances in an ASG-aware fashion, and thus ensure your target group deregistration delay is respected; allowing any remaining traffic to drain off the instance before it is actually terminated.
Perhaps you can try it out - I think you will find it does what you expect :-)
thanks for the explanation, it does what we expect exactly :-)
I am closing this request as TAINT_NODES=true
option does exactly what we want
with
RUN_MODE=1
all old nodes are cordoned at a same time, which makes AWS ELB to mark old nodes out of service, if new nodes sometimes take time to be in service then no healthy instances are left for sometime which causes outagewe tried cordnoning 1 node at a time and didnt see this issue, downside of this is that a pod may bounce multiple times because it may land up on old node because not all old nodes are cordoned, some people will be fine with bouncing one of a pod multiple times among multiple replicas.
can we have
RUN_MODE
5 which is same asRUN_MODE
1 except it "cordon 1 node --> drain 1 node --> delete 1 node" at a time instead of "cordon all nodes --> drain 1 node --> delete 1 node"