Guard against a concurrent downscaling and rollout

jhalterman commented 1 year ago

At present the rollout operator prevents concurrent rollout operations, but doesn’t prevent concurrent downscale + rollout operations. This can happen, for example, if the controller is currently rolling zone B or C, and then something downscales zone A.

In order to guard against this, I propose using an annotation on the resource being scaled to represent a “lock” across the scaling group. Ex:

grafana.com/scaling-lock: rollout

Creating a scaling-lock

When the rollout operator reconciles, it will choose a resource, if any, that already has a scaling lock to reconcile first, allowing reentrancy. If none exist, it will add a scaling-lock annotation to the sts when it begins rolling its pods. A scaling lock should not be needed though if the controller is just adjusting replicas up/down (since last-downscaled will already provide protection here), only when rolling pods.

Validating the scaling-lock

When the scaling-lock is present on any resource in the rollout-group, the downscale webhook will reject downscale attempts for resources that don't have the scaling lock. For example: if the scaling lock is present on the zone B statefulset, zone A's statefulset cannot be downscaled.

Removing the scaling-lock

When the rollout-operator reconciles a resource, if it sees no non-ready pods and has nothing to do for a resource, it will clear the scaling-lock for that resource if any is present.

Other alternatives

While this could also be accomplished by extending the last-updated annotation to be updated by the rollout-operator’s controller, that would block downscales from taking place for some time after rollouts are complete, which is not necessary. We just want to prevent concurrent scaling from occurring.

56quarters commented 1 year ago

I'd like to propose an alternative that I think would be a little simpler, and not involve modifications to the statefulset annotations.

Instead of adding a lock annotation that the prepare-downscale webhook checks for, we put roughly the same logic into the prepare-downscale that the controller would use in your plan.

In psuedo code, the prepare-downscale hook would check:

def is_being_rolled_out(statefulsets) -> bool:
    for sts in statefulsets:
        if not sts.state.replicas == sts.state.updated == sts.state.ready:
            return True
     return False

The prepare-downscale webhook would run this whenever it's called and reject the request if there is a rollout happening. This has the advantage of disallowing downscaling as soon as the updates to statefulsets (via PR + flux) are merged: all pods will no longer be considered "updated".

I also think this would be a better fit for the way the rollout operator works: there's not really any "state" to a rollout, the rollout operator reads the state of the world and adjusts statefulsets accordingly.

jhalterman commented 1 year ago

I like that proposal. There is a race though, where if a rollout is in progress and gets back to a steady state where sts.state.replicas == sts.state.updated == sts.state.ready, then a downscale could start at the same time. This would cause the next rollout iteration to fail, and we'd be stuck mid-rollout.

One way we could mitigate this is by making sure each rollout iteration doesn't delete all of the pods - it could delete max-unavailable - 1. Only after the last iteration would all replicas be updated and ready again, at which point a downscale could be allowed.

56quarters commented 1 year ago

I like that proposal. There is a race though, where if a rollout is in progress and gets back to a steady state where sts.state.replicas == sts.state.updated == sts.state.ready, then a downscale could start at the same time. This would cause the next rollout iteration to fail, and we'd be stuck mid-rollout.

I'm not clear about how you think it would get back to steady state in the middle of a rollout. Once a statefulset is updated as part of a rollout, all the pods in that statefulset would be considered not "updated".

jhalterman commented 1 year ago

You're right. I was thinking that nodes were marked as updated in batches for some reason, but of course that's not the case.

grafana / rollout-operator