keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
94 stars 28 forks source link

increase qps and burst to 100 #184

Closed ZihanJiang96 closed 8 months ago

ZihanJiang96 commented 8 months ago

Issue

When we terminate a large mount of nodes at the same time, let's say 600 nodes, lifecycle-manager can only process 75 node events per minute, which means 600/75=8 min. If we set the ASG Lifecycle hook's heartbeat timeout seconds to 300s, then some of the node events will never get processed and after the 300s timeout, the node will get terminated by ASG directly without proper drain, which leads to pod ungraceful shutdown.

Fixes/Improvements

  1. Increase client-go QPS from 5 to 100, Burst from 10 to 100

Now we are able to process 110 nodes per minute

codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (a70d012) 69.78% compared to head (e82d0b9) 69.78%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #184 +/- ## ======================================= Coverage 69.78% 69.78% ======================================= Files 12 12 Lines 1314 1314 ======================================= Hits 917 917 Misses 325 325 Partials 72 72 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

shreyas-badiger commented 8 months ago

I think we can keep Burst slightly higher. Maybe twice of the QPS? (a 100 and 200 in this case?) https://github.com/kubernetes/client-go/blob/5a0a4247921dd9e72d158aaa6c1ee124aba1da80/util/flowcontrol/throttle.go#L61C34-L61C34

Looks like Burst is just the initial allocation of tokens to query API server. Once the Burst is exhausted, the querying will be limited by the QPS