Process drain-failure nodes at the end

keikoproj / upgrade-manager

Reliable, extensible rolling-upgrades of Autoscaling groups in Kubernetes

Apache License 2.0

140 stars 45 forks source link

Problem: At present, if upgrade-manager comes across a node failure, it becomes stuck and fails to proceed with draining other nodes until the failed node is repaired. This may cause significant delays if the drain-failure node requires a considerable amount of time to fix

Proposal: Upgrade-manager skip the drain-failed nodes and return to them when all the other nodes in the InstanceGroups are rotated

Changes introduced by this PR:

Marks instances that failed to drain with failed-drain value
Upgrade-manager skips failed-drain instances while selecting the target and moves on to other instances
Once no healthy instances are left to rotate, it tries to rotate the failed-drain instances again

Codecov Report

Merging #394 (aaa0843) into master (7c245ca) will not change coverage. The diff coverage is 66.66%.

@@           Coverage Diff           @@
##           master     #394   +/-   ##
=======================================
  Coverage   39.09%   39.09%           
=======================================
  Files           7        7           
  Lines         931      931           
=======================================
  Hits          364      364           
  Misses        540      540           
  Partials       27       27

Flag	Coverage Δ
unittests	`39.09% <66.66%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
controllers/cloud.go	`65.38% <ø> (ø)`
controllers/upgrade.go	`46.21% <66.66%> (ø)`

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

keikoproj / upgrade-manager

Process drain-failure nodes at the end #394

Codecov Report