Closed jhalterman closed 11 months ago
I'd like to propose an alternative that I think would be a little simpler, and not involve modifications to the statefulset annotations.
Instead of adding a lock annotation that the prepare-downscale
webhook checks for, we put roughly the same logic into the prepare-downscale
that the controller would use in your plan.
In psuedo code, the prepare-downscale
hook would check:
def is_being_rolled_out(statefulsets) -> bool:
for sts in statefulsets:
if not sts.state.replicas == sts.state.updated == sts.state.ready:
return True
return False
The prepare-downscale
webhook would run this whenever it's called and reject the request if there is a rollout happening. This has the advantage of disallowing downscaling as soon as the updates to statefulsets (via PR + flux) are merged: all pods will no longer be considered "updated".
I also think this would be a better fit for the way the rollout operator works: there's not really any "state" to a rollout, the rollout operator reads the state of the world and adjusts statefulsets accordingly.
I like that proposal. There is a race though, where if a rollout is in progress and gets back to a steady state where sts.state.replicas == sts.state.updated == sts.state.ready
, then a downscale could start at the same time. This would cause the next rollout iteration to fail, and we'd be stuck mid-rollout.
One way we could mitigate this is by making sure each rollout iteration doesn't delete all of the pods - it could delete max-unavailable - 1
. Only after the last iteration would all replicas be updated and ready again, at which point a downscale could be allowed.
I like that proposal. There is a race though, where if a rollout is in progress and gets back to a steady state where
sts.state.replicas == sts.state.updated == sts.state.ready
, then a downscale could start at the same time. This would cause the next rollout iteration to fail, and we'd be stuck mid-rollout.
I'm not clear about how you think it would get back to steady state in the middle of a rollout. Once a statefulset is updated as part of a rollout, all the pods in that statefulset would be considered not "updated".
You're right. I was thinking that nodes were marked as updated in batches for some reason, but of course that's not the case.
At present the rollout operator prevents concurrent rollout operations, but doesn’t prevent concurrent downscale + rollout operations. This can happen, for example, if the controller is currently rolling zone B or C, and then something downscales zone A.
In order to guard against this, I propose using an annotation on the resource being scaled to represent a “lock” across the scaling group. Ex:
Creating a scaling-lock
When the rollout operator reconciles, it will choose a resource, if any, that already has a scaling lock to reconcile first, allowing reentrancy. If none exist, it will add a scaling-lock annotation to the sts when it begins rolling its pods. A scaling lock should not be needed though if the controller is just adjusting replicas up/down (since last-downscaled will already provide protection here), only when rolling pods.
Validating the scaling-lock
When the scaling-lock is present on any resource in the rollout-group, the downscale webhook will reject downscale attempts for resources that don't have the scaling lock. For example: if the scaling lock is present on the zone B statefulset, zone A's statefulset cannot be downscaled.
Removing the scaling-lock
When the rollout-operator reconciles a resource, if it sees no non-ready pods and has nothing to do for a resource, it will clear the scaling-lock for that resource if any is present.
Other alternatives
While this could also be accomplished by extending the last-updated annotation to be updated by the rollout-operator’s controller, that would block downscales from taking place for some time after rollouts are complete, which is not necessary. We just want to prevent concurrent scaling from occurring.