Closed danielfm closed 7 years ago
Out of interest, how big is your cluster?
My only concern is what happens when node which performs drain looping over multiple nodes tries to drain itself. I guess "drain controller" pod will be moved elsewhere and will discover state of ASGs and continue drain looping, right? If so then yes, it will be a good optimization to do.
Alternatively single pod can query ASG and update configmap, which is then updated in node-drainer-daemonset pods by Kube. Each node-drainer then checks configmap and figures out if it need to be drained and terminated. This way each nodes continues to be responsible for draining itself, but on the downside we now have two control loops: one updating configmap, another checking configmap and draining the node.
Either way, unless it grows >100-150 lines of shell, I prefer it to be keept as shell as it continues to be a pretty small script and redoing it in Go makes it harder to understand and fix should there be any problem. It also wont require any separate release cycle should it stay as an inline bash, which is a benefit too.
Now on implementation details:
I think describe-autoscaling-groups give you instance info and can be filtered by ASG tags (not via command, sadly, but as a separate filtering step with jq
) and gives you all the same info describe-auto-scaling-instances
Out of interest, how big is your cluster?
Currently ~32 instances managed by 6 ASGs.
My only concern is what happens when node which performs drain looping over multiple nodes tries to drain itself. I guess "drain controller" pod will be moved elsewhere and will discover state of ASGs and continue drain looping, right? If so then yes, it will be a good optimization to do.
Yes, we'd need to address this and other corner cases in the implementation. (My argument in favor of using Go would also help us address these corner cases more easily, more on that below.)
Either way, unless it grows >100-150 lines of shell, I prefer it to be kept as shell as it continues to be a pretty small script and redoing it in Go makes it harder to understand and fix should there be any problem. It also wont require any separate release cycle should it stay as an inline bash, which is a benefit too.
My argument in favor of using a more resourceful language for this, aside for providing proper tooling for testing, is because it makes it easier/possible to create more robust implementations (ex: add support for leader election).
I think describe-autoscaling-groups give you instance info and can be filtered by ASG tags (not via command, sadly, but as a separate filtering step with jq) and gives you all the same info describe-auto-scaling-instances
Nice, that's good to know. :)
Also, the release cycle could be independent as well; if the user wants to use a newer version of the node drainer, he/she could specify the desired version in cluster.yaml
and run kube-aws update
to apply the update, just like it's done today. In fact I see this as an advantage.
Alternatively single pod can query ASG and update configmap, which is then updated in node-drainer-daemonset pods by Kube. Each node-drainer then checks configmap and figures out if it need to be drained and terminated. This way each nodes continues to be responsible for draining itself, but on the downside we now have two control loops: one updating configmap, another checking configmap and draining the node.
Yes, that's one way to solve this issue. It would be easy enough to do that in plain bash as well, if the consensus is to keep things this way.
@redbaron another implementation could be as follows:
# On the ASG watcher pod
# Annotate nodes that are being terminated
aws autoscaling describe-auto-scaling-groups | jq '.AutoScalingGroups[] | select(.Tags[].Key | contains("kube-aws:")) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId'
kubectl get nodes -o json | jq '.items[] | select(.spec.externalID | contains("id-1", "id-2", "id-N")) | .metadata.name'
kubectl annotate <nodes-1> <node-2> <node-N> kube-aws.coreos.com/node-termination-scheduled=true
# On the drainer pods (still a daemonset), we check for the annotation and drain its own node if the annotation is present
This is slightly better since it does not rely on a ConfigMap just to tell which nodes must be drained.
That is a very nice way to do it! Do pods have access to node labels/annotations? Not without querying apiserver right?
Going with annotations, we can also utilise already existing reboot controllers instead of custom daemonset if they can do node draining, but if no such one exist then I'd prefer simple shell while true loop :)
That is a very nice way to do it! Do pods have access to node labels/annotations? Not without querying apiserver right?
Yes.
I'm running a modified version of the node drainer, just for testing purposes, and it seems to be working fine.
Going with annotations, we can also utilise already existing reboot controllers instead of custom daemonset if they can do node draining, but if no such one exist then I'd prefer simple shell while true loop :)
I don't know if we can use any of the reboot controllers, because in our case we need to call aws autoscaling complete-lifecycle-action
when the draining is complete, which is a very AWS-specific step.
I'm consistently being throttled on
DescribeAutoScalingInstances
andDescribeAutoScalingGroups
requests to the AWS API, most likely due to having both the cluster autoscaler and node drainer features enabled.For the cluster autoscaler, we can provide a way for users to customize the scan interval in order to compensate for larger clusters with a high number of auto scaling groups. I'm not familiar with the cluster autoscaler code to know if there are anything we can optimize there, so I'm not commenting on that.
Now, regarding the node drainer, we can probably make the current implementation much more efficient by running it in a single pod (instead of as a DaemonSet), that does something like this:
Note this is not the whole program, just some snippets to exemplify the idea. We can maybe take this opportunity to rewrite this thing (again) in Go, with some unit tests and everything.
What do you guys think?