Closed ZihanJiang96 closed 8 months ago
Attention: 12 lines
in your changes are missing coverage. Please review.
Comparison is base (
a70d012
) 69.78% compared to head (34c9500
) 69.89%.
Files | Patch % | Lines |
---|---|---|
pkg/service/server.go | 60.00% | 9 Missing and 3 partials :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
close. created a simpler version in #181
Issue
When we terminate a large mount of nodes at the same time, let's 600 nodes, lifecycle-manager can only process 75 node events per minute, which means
600/75=8
min. If we set the ASG Lifecycle hook's heartbeat timeout seconds to 300s, then some of the node events will never get processed and after the 300s timeout, the node will get terminated by ASG directly without proper drain, which leads to pod ungraceful shutdown.Fixes/Improvements
QPS
from 5 to 600,Burst
from 10 to 2000eventStream
channel size from 1 to 100nodeMetadataMap
to cache the node instanceID --> node name info (reduce the k8s client list call).MaxNumberOfMessages
from 1 to 10.VisibilityTimeout
from 30 to 120. (avoid duplicate message)After the improvements, lifecycle-manager could handle 200 node messages per minute.