Open eytan-avisror opened 4 years ago
One improvement that may improve this is refactoring server.go This block should not be called if message already exist, we should check
We should move these checks up https://github.com/keikoproj/lifecycle-manager/blob/48c3bc778f5a04df01b792994ae1406b371e8e29/pkg/service/server.go#L164-L182
Message validation has been refactored to avoid spinning up worker goroutines for invalid/duplicate messages, this might help a bit. But it seems the best improvement we can make here is around the deregistration waiters which spin up N Cluster TargetGroups Terminating Instances goroutines, which make up 99% of the goroutines we spin up. We should come up with some approach to do this differently if possible.
It was noticed when a huge spike of terminating instances happen (150 instances + another 150 instances after few minutes), we may not have enough goroutines or do not prioritize some goroutines efficiently.
The result was that and event that was received started processing 5-6 minutes later and by that time the heartbeat had already expired.
It's possible that
newWorker
goroutine is trying to spawn but there are so many other goroutines that it cannot.We should load test and evaulate:
newWorker
andsendHeartbeat
goroutines.The goal is to make sure we can handle large spikes of scale-downs without starvation