keikoproj / upgrade-manager

Reliable, extensible rolling-upgrades of Autoscaling groups in Kubernetes
Apache License 2.0
140 stars 45 forks source link

upgrade never completes if ASG size is 0 #347

Closed cspargo closed 4 months ago

cspargo commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST?:

BUG REPORT

What happened:

When using cluster-autoscaler (and maybe other scenarios) it is possible for there to be a situation where the ASG size can be zero, but there is still an instance in the ASG in Standby state. When this happens the reconcile loop repeatedly passes an empty batch to be rotated, and the rolling upgrade never completes. The remaining inprogress instance that is in standby never gets drained or terminated.

What you expected to happen:

The remaining instance gets drained and terminated.

How to reproduce it (as minimally and precisely as possible):

This only happens if there is an ASG with a single instance. Trigger a rolling upgrade and then wait for the replacement instance to join. (at this point, the existing instance will be in Standby state, and the ASG size will still be set to 1. This causes the ASG to launch a new instance)

To simulate the issue that we see, scale down the upgrade-manager controller before it sees the replacement instance as ready. Then wait for cluster-autoscaler to scale down the newly joined instance after it's unused node timeout expires (default is 10 minutes). At this point, the ASG size will now be 0 because cluster-autoscaler has scaled down the node, but there is still an instance in Standby. Then scale the upgrade-manager controller pod back up.

What happens then is that it never completes the rolling upgrade. Logs like this would repeat, but it didn't really terminate anything

INFO    controllers.RollingUpgrade      ***Reconciling***
INFO    controllers.RollingUpgrade      operating on existing rolling upgrade   {"scalingGroup": "xxx", "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647}, "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      scaling group details   {"scalingGroup": "xxx", "desiredInstances": 0, "launchConfig": "", "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      checking if rolling upgrade is completed        {"name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      rolling upgrade configured for forced refresh   {"instance": "i-07c85de278a98ea6", "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      drift detected in scaling group {"driftedInstancesCount/DesiredInstancesCount": "(1/0)", "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      selecting batch for rotation    {"batch size": 0, "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      found in-progress instances     {"instances": ["i-07c85de278a98ea6"]}
INFO    controllers.RollingUpgrade      rolling upgrade configured for forced refresh   {"instance": "i-07c85de278a98ea6", "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      rotating batch  {"instances": [], "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      no InService instances in the batch     {"batch": [], "instances(InService)": [], "name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      waiting for desired nodes       {"name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      desired nodes are ready {"name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      instances drained successfully, terminating     {"name": "instance-manager/yyy-20220211060814-14"}
INFO    controllers.RollingUpgrade      ***Reconciling***

Environment: