keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
93 stars 28 forks source link

EC2 failures and graceful shutdown can cause prolonged errors #44

Closed eytan-avisror closed 4 years ago

eytan-avisror commented 4 years ago

if EC2 failure occurs, and then the node is terminated by ASG or a person, the hook is received by lifecycle-manager and the drain/deregister flow will start. In this case we will fail to drain for as long as --drain-timeout, this keeps the instance alive in the meanwhile and applications can see errors due to instance still being in target-groups.

We should evaluate whether we should try to deregister-only or skip alltogether when the node state is unknown