Optimize usage of AWS AutoScaling API

danielfm commented 7 years ago

I'm consistently being throttled on DescribeAutoScalingInstances and DescribeAutoScalingGroups requests to the AWS API, most likely due to having both the cluster autoscaler and node drainer features enabled.

For the cluster autoscaler, we can provide a way for users to customize the scan interval in order to compensate for larger clusters with a high number of auto scaling groups. I'm not familiar with the cluster autoscaler code to know if there are anything we can optimize there, so I'm not commenting on that.

Now, regarding the node drainer, we can probably make the current implementation much more efficient by running it in a single pod (instead of as a DaemonSet), that does something like this:

# This would be the node draining loop:

# 1. Take the externalID (which maps to the EC2 instance ID) from all nodes
# (We need to return some `externalID -> node name` mapping though, so we know which node to drain based on its EC2 instance ID)
$ kubectl get nodes -o jsonpath='{range .items[*]}"{.spec.externalID}" {end}'

# 2. Use a single call to retrieve the current status for all nodes
# (We can only specify 50 IDs at a time, so we need to be careful in order to handle large clusters properly)
$ aws autoscaling describe-auto-scaling-instances --instance-ids <ids>

# 3. For all nodes that are flagged as being terminated
$ kubectl drain --force --ignore-daemonset --delete-local-data <node name>

# 4. Sleep for a few seconds and repeat

Note this is not the whole program, just some snippets to exemplify the idea. We can maybe take this opportunity to rewrite this thing (again) in Go, with some unit tests and everything.

What do you guys think?

redbaron commented 7 years ago

Out of interest, how big is your cluster?

My only concern is what happens when node which performs drain looping over multiple nodes tries to drain itself. I guess "drain controller" pod will be moved elsewhere and will discover state of ASGs and continue drain looping, right? If so then yes, it will be a good optimization to do.

Alternatively single pod can query ASG and update configmap, which is then updated in node-drainer-daemonset pods by Kube. Each node-drainer then checks configmap and figures out if it need to be drained and terminated. This way each nodes continues to be responsible for draining itself, but on the downside we now have two control loops: one updating configmap, another checking configmap and draining the node.

Either way, unless it grows >100-150 lines of shell, I prefer it to be keept as shell as it continues to be a pretty small script and redoing it in Go makes it harder to understand and fix should there be any problem. It also wont require any separate release cycle should it stay as an inline bash, which is a benefit too.

Now on implementation details: I think describe-autoscaling-groups give you instance info and can be filtered by ASG tags (not via command, sadly, but as a separate filtering step with jq) and gives you all the same info describe-auto-scaling-instances

danielfm commented 7 years ago

Out of interest, how big is your cluster?

Currently ~32 instances managed by 6 ASGs.

My only concern is what happens when node which performs drain looping over multiple nodes tries to drain itself. I guess "drain controller" pod will be moved elsewhere and will discover state of ASGs and continue drain looping, right? If so then yes, it will be a good optimization to do.

Yes, we'd need to address this and other corner cases in the implementation. (My argument in favor of using Go would also help us address these corner cases more easily, more on that below.)

Either way, unless it grows >100-150 lines of shell, I prefer it to be kept as shell as it continues to be a pretty small script and redoing it in Go makes it harder to understand and fix should there be any problem. It also wont require any separate release cycle should it stay as an inline bash, which is a benefit too.

My argument in favor of using a more resourceful language for this, aside for providing proper tooling for testing, is because it makes it easier/possible to create more robust implementations (ex: add support for leader election).

I think describe-autoscaling-groups give you instance info and can be filtered by ASG tags (not via command, sadly, but as a separate filtering step with jq) and gives you all the same info describe-auto-scaling-instances

Nice, that's good to know. :)

danielfm commented 7 years ago

Also, the release cycle could be independent as well; if the user wants to use a newer version of the node drainer, he/she could specify the desired version in cluster.yaml and run kube-aws update to apply the update, just like it's done today. In fact I see this as an advantage.

danielfm commented 7 years ago

Alternatively single pod can query ASG and update configmap, which is then updated in node-drainer-daemonset pods by Kube. Each node-drainer then checks configmap and figures out if it need to be drained and terminated. This way each nodes continues to be responsible for draining itself, but on the downside we now have two control loops: one updating configmap, another checking configmap and draining the node.

Yes, that's one way to solve this issue. It would be easy enough to do that in plain bash as well, if the consensus is to keep things this way.

danielfm commented 7 years ago

@redbaron another implementation could be as follows:

# On the ASG watcher pod

# Annotate nodes that are being terminated
aws autoscaling describe-auto-scaling-groups | jq '.AutoScalingGroups[] | select(.Tags[].Key | contains("kube-aws:")) | .Instances[] | select(.LifecycleState == "Terminating:Wait") | .InstanceId'
kubectl get nodes -o json | jq '.items[] | select(.spec.externalID | contains("id-1", "id-2", "id-N")) | .metadata.name'
kubectl annotate <nodes-1> <node-2> <node-N> kube-aws.coreos.com/node-termination-scheduled=true

# On the drainer pods (still a daemonset), we check for the annotation and drain its own node if the annotation is present

This is slightly better since it does not rely on a ConfigMap just to tell which nodes must be drained.

redbaron commented 7 years ago

That is a very nice way to do it! Do pods have access to node labels/annotations? Not without querying apiserver right?

Going with annotations, we can also utilise already existing reboot controllers instead of custom daemonset if they can do node draining, but if no such one exist then I'd prefer simple shell while true loop :)

danielfm commented 7 years ago

That is a very nice way to do it! Do pods have access to node labels/annotations? Not without querying apiserver right?

Yes.

I'm running a modified version of the node drainer, just for testing purposes, and it seems to be working fine.

Going with annotations, we can also utilise already existing reboot controllers instead of custom daemonset if they can do node draining, but if no such one exist then I'd prefer simple shell while true loop :)

I don't know if we can use any of the reboot controllers, because in our case we need to call aws autoscaling complete-lifecycle-action when the draining is complete, which is a very AWS-specific step.

kubernetes-retired / kube-aws

Optimize usage of AWS AutoScaling API #782