It was seen on clusters with massive numbers of ALBs/Target Groups (300-400 ALBs + 400-500 Target groups), things start to break down - once controller starts getting throttled very heavily, other components such as alb-ingress-controller also fails to register/deregister targets.
Also in #27 we introduced a condition where the life of an instance cannot be extended infinitely - and a lifecycle hook is timed out after 1hr.
When throttling gets very very heavy and many instances are terminating (10+) we basically lock up the AWS API with all the calls we are making and a lot of instances eventually get abandoned after spending an hour getting throttled - which ends up with 5xx errors to target groups that did not deregister.
We should consider the following improvements for very large clusters:
Instead of starting the deregister waiter after jitter (0-180s), we should start the waiter after a different range which includes the deregistration delay. e.g. if the deregistration delay is 300 (default), this means we know for a fact the deregister will take atleast 300 seconds, so the jitter should probably be on top of that, such as a range between 300-400 seconds. this will mean much less calls we know will return telling us the target is still deregistering. this will reduce the load on other components such as alb controller during the time of initial deregistration.
Make backoff range even larger/wider than currently set range (5s-60s). e.g. (30s-180s)
Make lifecycle hook timeout configurable - for massive cluster it may be acceptable to set a higher number
Have some logic around checking how many target groups exist in the account.
Calculate how many concurrent terminations we can handle.
queue instances which are over the limit (possibly have a higher timeout for queued instances)
Fix #34 - in such scenarios it will make recovery much faster after the instance has been abandoned.
It was seen on clusters with massive numbers of ALBs/Target Groups (300-400 ALBs + 400-500 Target groups), things start to break down - once controller starts getting throttled very heavily, other components such as alb-ingress-controller also fails to register/deregister targets.
Also in #27 we introduced a condition where the life of an instance cannot be extended infinitely - and a lifecycle hook is timed out after 1hr.
When throttling gets very very heavy and many instances are terminating (10+) we basically lock up the AWS API with all the calls we are making and a lot of instances eventually get abandoned after spending an hour getting throttled - which ends up with 5xx errors to target groups that did not deregister.
We should consider the following improvements for very large clusters:
Instead of starting the deregister waiter after jitter (0-180s), we should start the waiter after a different range which includes the deregistration delay. e.g. if the deregistration delay is 300 (default), this means we know for a fact the deregister will take atleast 300 seconds, so the jitter should probably be on top of that, such as a range between 300-400 seconds. this will mean much less calls we know will return telling us the target is still deregistering. this will reduce the load on other components such as alb controller during the time of initial deregistration.
Make backoff range even larger/wider than currently set range (5s-60s). e.g. (30s-180s)
Make lifecycle hook timeout configurable - for massive cluster it may be acceptable to set a higher number
Have some logic around checking how many target groups exist in the account. Calculate how many concurrent terminations we can handle. queue instances which are over the limit (possibly have a higher timeout for queued instances)
Fix #34 - in such scenarios it will make recovery much faster after the instance has been abandoned.