keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
92 stars 28 forks source link

Prevent conflicts when running multiple lifecycle-managers #71

Closed mrak closed 3 years ago

mrak commented 3 years ago

Description

No argument or functionality changes. This is an internal machinery change that annotates Node objects with another annotation (in addition to the original in-progress annotation) that signifies the SQS queue name that the Node's lifecycle is being managed by.

Code changes were done to keep the atomicity of annotation create/delete to avoid inconsistent states when a lifecycle-manager shutdown could otherwise occur between API requests.

This allows us to run two instances of lifecycle-manager as our clusters have more than one underlying Node hardware with different draining and termination SLAs. Previsouly, on startup lifecycle-manager would "claim" all Nodes with the in-progress annotation and start enforcing its drain-timeouts upon them. This was causing unexpected behavior when more than one was running and in the process of draining a node one of the lifecycle-manager Pods got evicted and subsequently took over the Nodes' lifecycles.

As we differentiate our Node types and ASGs using different queues, the queue name seemed like the best item to annotate with so that lifecycle-manager can self-select Nodes on startup that it should be responsible for. In the event that only one lifecycle-manager is running in a cluster, this should behave as expected.

The only caveat I can think of is if you change the queue name you use for Node lifecycles on the ASG but do not update the lifecycle manager before hand.

codecov[bot] commented 3 years ago

Codecov Report

Merging #71 (2c0c2e4) into master (5a0d2e2) will increase coverage by 0.06%. The diff coverage is 80.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
+ Coverage   69.74%   69.81%   +0.06%     
==========================================
  Files          12       12              
  Lines         899      911      +12     
==========================================
+ Hits          627      636       +9     
- Misses        212      215       +3     
  Partials       60       60              
Impacted Files Coverage Δ
pkg/service/server.go 59.22% <66.66%> (+0.18%) :arrow_up:
pkg/service/nodes.go 83.33% <100.00%> (+0.61%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5a0d2e2...2c0c2e4. Read the comment docs.

mrak commented 3 years ago

Updated to remove the Goto label and to proceed on Nodes missing the queue name annotation

eytan-avisror commented 3 years ago

Thanks! Looks good and a seems like a great feature 🥇 Can you just do a small manual test to make sure we are not causing abandons when we upgrade if there is a node in terminating state? code seems like it will handle it correctly now, but it might be a good test to run.

mrak commented 3 years ago

Roger that. We tested the following:

  1. launched a single lifecycle-manager deployment at 0.4.3 (existing release).
  2. initiated an ASG refresh
  3. Waited for at least one Node SQS event to be received by lifecycle manager and for it to annotate the Node as in-progress.
  4. Launched our fork of lifecycle-manager with this PR code.
  5. Witnessed the new lifecycle-manager pick up ownership of the in-progress Node and take it to drain completion.