keikoproj / instance-manager

Create and manage instance groups with Kubernetes
Apache License 2.0
155 stars 40 forks source link

Events: publish autoscaling events and kubernetes events #292

Open eytan-avisror opened 3 years ago

eytan-avisror commented 3 years ago

We should evaluate publishing an autoscaling group's activity history events as Kubernetes events. This might be useful for understanding why certain events happened in the cluster.

Implementation can be dependent on reconcile, e.g. when ever we reconcile an IG we can use https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#AutoScaling.DescribeScalingActivities to get the list of activities.

We should avoid pagination since that would make the call costly across many IGs, we can add caching with TTL of around 10 minutes to make sure we don't get throttled.

This means that every 10 minutes (given a reconcile), we publish new events. We will need to track the last ActivityID we published in order to not re-publish events we already published. https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#Activity

backjo commented 3 years ago

I wonder if it makes sense to use EventBridge + SQS for this. We use the AWS Node Termination Handler (https://github.com/aws/aws-node-termination-handler) with the queue processor, which uses EventBridge + SQS under the hood, and have found it to be fairly low latency for reacting to AutoScaling events.

eytan-avisror commented 3 years ago

Yeah, we use a similar mechanism with https://github.com/keikoproj/lifecycle-manager

But I'm not sure about adding an SQS queue, EventBridge, etc - it might eventually end up being more API calls than just getting the events from the ASG, no? We should definitely explore what is the cheaper solutions in terms of calls, and also what we are asking the users to do (such as give controller access to SQS / EventBridge) as well as the monetary cost with using those services, where making the API call is free.

Upside would be that having an SQS queue associated with the controller might be beneficial in the long run for other features we might use the same queue for