keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
93 stars 28 forks source link

Further improvements to API calls #35

Open eytan-avisror opened 4 years ago

eytan-avisror commented 4 years ago

It was seen on clusters with massive numbers of ALBs/Target Groups (300-400 ALBs + 400-500 Target groups), things start to break down - once controller starts getting throttled very heavily, other components such as alb-ingress-controller also fails to register/deregister targets.

Also in #27 we introduced a condition where the life of an instance cannot be extended infinitely - and a lifecycle hook is timed out after 1hr.

When throttling gets very very heavy and many instances are terminating (10+) we basically lock up the AWS API with all the calls we are making and a lot of instances eventually get abandoned after spending an hour getting throttled - which ends up with 5xx errors to target groups that did not deregister.

We should consider the following improvements for very large clusters: