keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
94 stars 28 forks source link

Permormance optimiztion: avoid some list call, increase queue length, increase client-go QPS #181

Closed ZihanJiang96 closed 8 months ago

ZihanJiang96 commented 8 months ago

Issue

When we terminate a large mount of nodes at the same time, let's 600 nodes, lifecycle-manager can only process 75 node events per minute, which means 600/75=8 min. If we set the ASG Lifecycle hook's heartbeat timeout seconds to 300s, then some of the node events will never get processed and after the 300s timeout, the node will get terminated by ASG directly without proper drain, which leads to pod ungraceful shutdown.

Fixes/Improvements

  1. Increase client-go QPS from 5 to 600, Burst from 10 to 2000
  2. Increase the eventStream channel size from 1 to 100
  3. Add nodeMetadataMap to cache the node instanceID --> node name info (reduce the k8s client list call).
  4. Inrease per SQS ReceiveMessage call MaxNumberOfMessages from 1 to 10.
  5. Inrease per SQS ReceiveMessage call VisibilityTimeout from 30 to 120. (avoid duplicate message)

After the improvements, lifecycle-manager could handle 200 node messages per minute.

codecov[bot] commented 8 months ago

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (a70d012) 69.78% compared to head (34c9500) 69.89%.

Files Patch % Lines
pkg/service/server.go 60.00% 9 Missing and 3 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #181 +/- ## ========================================== + Coverage 69.78% 69.89% +0.10% ========================================== Files 12 12 Lines 1314 1322 +8 ========================================== + Hits 917 924 +7 - Misses 325 326 +1 Partials 72 72 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ZihanJiang96 commented 8 months ago

close. created a simpler version in #181