keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
94 stars 28 forks source link

Make lifecycle-manager HA #186

Open 2rs2ts opened 8 months ago

2rs2ts commented 8 months ago

Is this a BUG REPORT or FEATURE REQUEST?: Feature Request

What happened: If you run multiple replicas of this software in order to make sure that it evicting itself isn't disruptive and doesn't cause any alerts about under-replicated deployments (something a lot of cluster operators do in order to catch capacity issues or other uptime issues,) then it will log a bunch of errors and warnings because the multiple replicas step on each others' toes. If you run just one replica, well, you get all the problems of having something important to your normal rollout operations just disappear sometimes because it got evicted, which is... suboptimal, to say the least.

What you expected to happen: I would like for this project to do leader election via Leases (pretty easy to do with the k8s golang SDK) and, if its leader disappears, the next leader should be able to pick up on any ongoing operation like a node drain or responding to a lifecycle event.

Implementing leader election is pretty easy, but implementing picking up where you left off might be a little more complicated–not sure exactly how much more, since I'm not really familiar with the codebase, but I reckon that in a worst case scenario, it would just mean you'd have to either write logic that intuits what the previous leader was up to (which you might already have, given that you wrote this thing without leader election, and it doesn't seem to just completely kill ASG updates when it evicts itself) or you may need to write some sort of progress state to a ConfigMap.

How to reproduce it (as minimally and precisely as possible): For the errors in logs, just run multiple replicas and do a normal ASG update. For all the degenerate cases caused by having only 1 replica, well, you're already experiencing it with your own clusters probably, aren't you?

Anything else we need to know?:

Environment:

Other debugging information (if applicable):

This is one of the kinds of errors you will see if you run multiple replicas:

time="2024-01-04T22:54:32Z" level=error msg="failed to complete lifecycle action: ValidationError: No active Lifecycle Action found with instance ID i-0f51b5dffa4dd4b12\n\tstatus code: 400, request id: 92d0b3d9-711a-4482-aab8-33265e7cab49"