keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
94 stars 28 forks source link

Lifecycle manager terminates immediately if the api becomes unavailable #189

Open omgrr opened 7 months ago

omgrr commented 7 months ago

Is this a BUG REPORT or FEATURE REQUEST?:

Feature Request

What happened:

If the API becomes unavailable or unreachable during an instance refresh, the drain command fails and returns instantly (ignoring any kind of drain timeout) and then it will ABANDON the terminating lifecycle hook. This means that the instance is then terminated immediately. It then continues to do this for all of the instances in your ASG which is obviously quite destructive.

These are some choice log lines with a 30 second drain interval (I did this on my branch with the drain interval put back in), retries at 3, and a drain timeout of 2 minutes:

2024-01-16T16:33:50.953389292Z stderr F time="2024-01-16T16:33:50Z" level=info msg="retrying drain, <node name>"
     # The first drain fails, and then it waits 30 seconds because of the drain interval

2024-01-16T16:34:20.960329596Z stderr F time="2024-01-16T16:34:20Z" level=error msg="failed to drain node <node name>, error: node not found"
2024-01-16T16:34:20.960333553Z stderr F time="2024-01-16T16:34:20Z" level=info msg="retrying drain, node <node name>"
     # These next 2 log lines show it failed immediately and retried, again the drain-timeout 
     # doesn't protect us here because the api returns immediately. 
     # It will then wait 30 more seconds before repeating this.

2024-01-16T16:35:21.040452793Z stderr F time="2024-01-16T16:35:21Z" level=error msg="failed to drain node <node name>, error: node not found"
2024-01-16T16:35:21.040456625Z stderr F time="2024-01-16T16:35:21Z" level=info msg="retrying drain, node <node name>"
     # And same thing, it fails and then retries immediately

What you expected to happen:

In the case where I'm in the middle of a worker rotation, it would be better if the lifecycle manager didn't abandon the lifecycle hooks if the api is down, and instead there was a configurable way to make it wait.

I also don't want to configure the drain interval to make it wait. For example, if the case of a PDB blocking a drain I might want my instances to wait 10 minutes. However, if the api were down, I would extend that wait period indefinitely or until the lifecycle timeout period which is 48 hours. There isn't a way with the current settings to do this.

How to reproduce it (as minimally and precisely as possible):

You just need to start an instance refresh and then make the apiserver unreachable. I was doing this by stopping the actual apiserver process.

Anything else we need to know?:

Before thinking more about what theses configuration options might look like, or the implementation, I wanted to see if this is something you would be interested in or have thought about? If so I'm happy to take a stab at it!

Environment: