When using cluster-autoscaler (and maybe other scenarios) it is possible for there to be a situation where the ASG size can be zero, but there is still an instance in the ASG in Standby state. When this happens the reconcile loop repeatedly passes an empty batch to be rotated, and the rolling upgrade never completes. The remaining inprogress instance that is in standby never gets drained or terminated.
What you expected to happen:
The remaining instance gets drained and terminated.
How to reproduce it (as minimally and precisely as possible):
This only happens if there is an ASG with a single instance. Trigger a rolling upgrade and then wait for the replacement instance to join. (at this point, the existing instance will be in Standby state, and the ASG size will still be set to 1. This causes the ASG to launch a new instance)
To simulate the issue that we see, scale down the upgrade-manager controller before it sees the replacement instance as ready. Then wait for cluster-autoscaler to scale down the newly joined instance after it's unused node timeout expires (default is 10 minutes). At this point, the ASG size will now be 0 because cluster-autoscaler has scaled down the node, but there is still an instance in Standby. Then scale the upgrade-manager controller pod back up.
What happens then is that it never completes the rolling upgrade. Logs like this would repeat, but it didn't really terminate anything
INFO controllers.RollingUpgrade ***Reconciling***
INFO controllers.RollingUpgrade operating on existing rolling upgrade {"scalingGroup": "xxx", "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647}, "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade scaling group details {"scalingGroup": "xxx", "desiredInstances": 0, "launchConfig": "", "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade checking if rolling upgrade is completed {"name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade rolling upgrade configured for forced refresh {"instance": "i-07c85de278a98ea6", "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade drift detected in scaling group {"driftedInstancesCount/DesiredInstancesCount": "(1/0)", "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade selecting batch for rotation {"batch size": 0, "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade found in-progress instances {"instances": ["i-07c85de278a98ea6"]}
INFO controllers.RollingUpgrade rolling upgrade configured for forced refresh {"instance": "i-07c85de278a98ea6", "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade rotating batch {"instances": [], "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade no InService instances in the batch {"batch": [], "instances(InService)": [], "name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade waiting for desired nodes {"name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade desired nodes are ready {"name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade instances drained successfully, terminating {"name": "instance-manager/yyy-20220211060814-14"}
INFO controllers.RollingUpgrade ***Reconciling***
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
When using cluster-autoscaler (and maybe other scenarios) it is possible for there to be a situation where the ASG size can be zero, but there is still an instance in the ASG in Standby state. When this happens the reconcile loop repeatedly passes an empty batch to be rotated, and the rolling upgrade never completes. The remaining inprogress instance that is in standby never gets drained or terminated.
What you expected to happen:
The remaining instance gets drained and terminated.
How to reproduce it (as minimally and precisely as possible):
This only happens if there is an ASG with a single instance. Trigger a rolling upgrade and then wait for the replacement instance to join. (at this point, the existing instance will be in Standby state, and the ASG size will still be set to 1. This causes the ASG to launch a new instance)
To simulate the issue that we see, scale down the upgrade-manager controller before it sees the replacement instance as ready. Then wait for cluster-autoscaler to scale down the newly joined instance after it's unused node timeout expires (default is 10 minutes). At this point, the ASG size will now be 0 because cluster-autoscaler has scaled down the node, but there is still an instance in Standby. Then scale the upgrade-manager controller pod back up.
What happens then is that it never completes the rolling upgrade. Logs like this would repeat, but it didn't really terminate anything
Environment: