Closed mrak closed 3 years ago
Merging #71 (2c0c2e4) into master (5a0d2e2) will increase coverage by
0.06%
. The diff coverage is80.95%
.
@@ Coverage Diff @@
## master #71 +/- ##
==========================================
+ Coverage 69.74% 69.81% +0.06%
==========================================
Files 12 12
Lines 899 911 +12
==========================================
+ Hits 627 636 +9
- Misses 212 215 +3
Partials 60 60
Impacted Files | Coverage Δ | |
---|---|---|
pkg/service/server.go | 59.22% <66.66%> (+0.18%) |
:arrow_up: |
pkg/service/nodes.go | 83.33% <100.00%> (+0.61%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 5a0d2e2...2c0c2e4. Read the comment docs.
Updated to remove the Goto label and to proceed on Nodes missing the queue name annotation
Thanks! Looks good and a seems like a great feature 🥇 Can you just do a small manual test to make sure we are not causing abandons when we upgrade if there is a node in terminating state? code seems like it will handle it correctly now, but it might be a good test to run.
Roger that. We tested the following:
0.4.3
(existing release).
Description
No argument or functionality changes. This is an internal machinery change that annotates
Node
objects with another annotation (in addition to the original in-progress annotation) that signifies the SQS queue name that the Node's lifecycle is being managed by.Code changes were done to keep the atomicity of annotation create/delete to avoid inconsistent states when a lifecycle-manager shutdown could otherwise occur between API requests.
This allows us to run two instances of lifecycle-manager as our clusters have more than one underlying Node hardware with different draining and termination SLAs. Previsouly, on startup lifecycle-manager would "claim" all Nodes with the in-progress annotation and start enforcing its drain-timeouts upon them. This was causing unexpected behavior when more than one was running and in the process of draining a node one of the lifecycle-manager Pods got evicted and subsequently took over the Nodes' lifecycles.
As we differentiate our Node types and ASGs using different queues, the queue name seemed like the best item to annotate with so that lifecycle-manager can self-select Nodes on startup that it should be responsible for. In the event that only one lifecycle-manager is running in a cluster, this should behave as expected.
The only caveat I can think of is if you change the queue name you use for Node lifecycles on the ASG but do not update the lifecycle manager before hand.