siegfriedweber commented 2 years ago

Current behaviour

New restarter-enabled StatefulSet have their replica 0 restarted after the initial rollout is complete.

Expected Behaviour

The initial rollout should be completed "normally" with no extra restarts.

Why does this happen?

There is a race condition between Kubernetes' StatefulSet controller creating the first replica Pod and commons' StatefulSet restart controller adding the restart trigger labels. If the restarter loses the race then the first replica is created without the metadata, triggering a restart once it is added.

What can we do about it?

Add a mutating webhook (see the spike) that adds the relevant metadata. The webhook must not replace the existing controller, since webhook delivery is not reliable.

However, webhook delivery requires a bunch of extra infrastructure that we do not currently have, namely:

We must implement the K8s webhook HTTPS API (https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
We must generate a TLS certificate and write that into the MWC

Definition of done

[ ] Webhook certificate is managed (provisioned and renewed) by the commons operator
[ ] The Webhook should apply the same podTemplate annotations as the controller currently does
- [ ] Initial STS rollout does not cause a restart (STS.metadata.generation should stay 1)
[x] Controller must still apply metadata if the webhook is disabled and/or fails (at the cost of still doing the extra restart in this case)
[ ] The webhook must fail open (failurePolicy: Ignore)
[ ] A Kuttl test should verify the above (maybe minus the failurePolicy)

Original ticket

The StatefulSet of a Superset cluster is immediately restarted after its creation. This should not be necessary and should be prevented. ```bash $ kubectl describe statefulset simple-superset-node-default ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulDelete 3m31s statefulset-controller delete Pod simple-superset-node-default-0 in StatefulSet simple-superset-node-default successful Normal SuccessfulCreate 2m59s (x2 over 3m31s) statefulset-controller create Pod simple-superset-node-default-0 in StatefulSet simple-superset-node-default successful ``` After the restart, the Superset pods are annotated as follows: ```yaml annotations: configmap.restarter.stackable.tech/simple-superset-node-default: cf60300e-0c45-4ee2-b60c-de53b0084182/21998 secret.restarter.stackable.tech/simple-superset-credentials: e0e4b781-46f9-44f4-80c8-a0876b91ed8b/16909 ``` This could be an indication that the restart controller of the commons-operator is involved. The commons-operator is busy while the StatefulSet is restarted: ``` 2022-09-16T10:02:50.228971Z INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default 2022-09-16T10:02:50.236144Z INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default 2022-09-16T10:02:50.239371Z INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default 2022-09-16T10:02:50.255219Z INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default 2022-09-16T10:02:50.258511Z INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default 2022-09-16T10:02:50.266146Z INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default 2022-09-16T10:02:50.274647Z INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default 2022-09-16T10:02:51.433621Z INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default 2022-09-16T10:02:51.449009Z INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default ```

siegfriedweber commented 2 years ago

Actually this is how the restart controller is supposed to work. The Superset operator creates a stateful set labeled with restarter.stackable.tech/enabled=true. The restart controller of the commons operator watches this labeled stateful set and adds the UIDs and resource versions of the referenced config maps and secrets as annotations to the pod template. This triggers a restart of the stateful set right after its first creation. The commons operator watches these config maps and secrets and updates the annotations whenever these resources change which in turn triggers a restart. This ensures that Superset always runs with the latest configuration.

Even if the immediate restart is confusing, I do not see an elegant solution to prevent this, so I propose to close this issue.

fhennig commented 2 years ago

If the outcome is "this is how it is supposed to work", then it might not be a bug, be we should still go back to the drawing board and look at what we've built, because this immediate restart is really something that shouldn't happen.

Maybe the title of the issue and also it's place in this repo is not correct anymore.

After talking to sigi, I'm moving it back to "Refinement: In Progress"

vsupalov commented 2 years ago

Taking a step back. Outcomes from the refinement discussion:

Observed Situation

A clean system is easier to monitor, debug, and operate
Random restarts are surprising in a bad way
- Logs are not mentioning reason for this behaviour - just the restart is visible
We want our systems to be understandable, and observable (know why restart happens)
The restart controller causes a restart, which might be avoided
Two operators working on the same resource interact in sometimes hard to understand fashion (badly designed system?)

Investigation Worthy Threads

Can we avoid this kind of initial restart? Are they strictly necessary?
- Mutating admission controller / webhook (?) link
- Port restarting logic into the operator-rs framework (it is a library!)? Reusable functions.
Can we add logs to understand what's going on
- Add logging outputs to justify / announce restarts
Root cause analysis: how was the decision made that we want to have this kind of pattern?
- Possible "design guideline" which could have been helpful: "only one operator is modifying a single resource"

Possible Next Steps

Create spike investigating if restart is necessary?
Address "how was restarting decision" in architecture meeting?
Do we need somebody who feels responsible for higher-level architecture patterns?

vsupalov commented 2 years ago

@soenkeliebau @lfrancke unclear how to proceed here, some input would be great. Do we leave things as they are here, or do we pursue one of the investigation threads?

This ticket can be closed in favour of these other investigation threads.

nightkr commented 1 year ago

That the commons-operator is reconciling doesn't really say much, that'll happen whenever the STS is modified externally too. If you want to check whether it has triggered a restart then you'll want to check the STS YAMLs for whether the restarter annotations change.

I'd also argue that the restart controller isn't the cause of the problem here, just a symptom. The restart controller will trigger when the app's config changes. Check for why that happens if you want to avoid this.

nightkr commented 1 year ago

(Case in point, if the restart controller is at fault then this will happen for all product operators, not just Superset...)

lfrancke commented 1 year ago

Thanks Teo.

The restart controller of the commons operator watches this labeled stateful set and adds the UIDs and resource versions of the referenced config maps and secrets as annotations to the pod template. This triggers a restart of the stateful set right after its first creation.

I'm a bit confused now, though. This sounds like the answer that you were looking for, no? I must be missing something still. I thought the restart happens because the restart controller changes the configmaps which in turn triggers a restart of the STS?

nightkr commented 1 year ago

The restart controller just watches the configmaps for changes, and triggers STS restarts based on that. The restart controller is just a messenger for the actual issue (that the CM changed). It never modifies the CMs itself.

lfrancke commented 1 year ago

Understood, thank you. So, the described behavior has to come from somewhere else then.

@siegfriedweber @fhennig @vsupalov - sorry, back to you for now?

siegfriedweber commented 1 year ago

If you want to check whether it has triggered a restart then you'll want to check the STS YAMLs for whether the restarter annotations change.

The restarter annotations are initially added to the STS YAMLs by the restart controller. This triggers the restart.

The restart controller will trigger when the app's config changes.

The restart controller also triggers when a new StatefulSet with the annotation restarter.stackable.tech/enabled=true was created.

(Case in point, if the restart controller is at fault then this will happen for all product operators, not just Superset...)

It happens for all StatefulSets which are annotated with restarter.stackable.tech/enabled=true. That is, it also happens for the Airflow operator. The other operators do not use this annotation yet.

I think that I accurately described the issue in my first comment:

Actually this is how the restart controller is supposed to work. The Superset operator creates a stateful set labeled with restarter.stackable.tech/enabled=true. The restart controller of the commons operator watches this labeled stateful set and adds the UIDs and resource versions of the referenced config maps and secrets as annotations to the pod template. This triggers a restart of the stateful set right after its first creation. The commons operator watches these config maps and secrets and updates the annotations whenever these resources change which in turn triggers a restart. This ensures that Superset always runs with the latest configuration.

Please tell me if I am wrong.

soenkeliebau commented 1 year ago

@siegfriedweber and @teozkr will have a chat about this

nightkr commented 1 year ago

Hm, the label "should" still be added quickly enough that we the pods shouldn't have been scheduled yet, but maybe my memory fails me on how much of a window we have, will have to do some testing tomorrow... If that is indeed the sad case then a fail-open (since we'd still need the controller as well, the worst that could happen would be falling back to the current extra restart) mutating webhook probably makes sense at some point, and should be completely transparent to the operators.

The downside is that doing a first webhook means doing a bunch of new scaffolding from scratch (certificates, kube API registration, etc)...

nightkr commented 1 year ago

Ok, having tried it out now it does look like the first replica is restarted while all other replicas are created late enough that they are up to date. Avoiding that glitch will probably require us to do the webhook...

nightkr commented 1 year ago

I have validated that a mutating webhook is able to avoid the initial restart for a simple STS... https://github.com/stackabletech/commons-operator/commit/0cbf72c90741fd618a9e97266f8fdd66c20ee544

Moving this ticket to commons since this is a bug in the restarter, not in superset-op specifically.

fhennig commented 1 year ago

So, is the mutating webhook that accepted solution to this problem? I see that this ticket is in refinement acceptance, I'd like to know the up- and downsides of this solution. Is there another way? how bad is:

The downside is that doing a first webhook means doing a bunch of new scaffolding from scratch (certificates, kube API registration, etc)...

I personally wouldn't call this ticket refined, as I wouldn't know what to do. The ticket doesn't have tasks or even acceptance criteria.

lfrancke commented 1 year ago

I agree, this came up during the acceptance. So I'll move it back for now.

lfrancke commented 1 year ago

A few questions:

Do I understand it correctly that this will "only" fix the issue in most cases since delivery of webhooks is not reliable?
Do webhooks require any kind of extra privileges that we currently don't need?
Do webhooks work on OpenShift?
Does this need to be presented to the team again? I assume no because I believe this is the outcome of the last team session, right?

nightkr commented 1 year ago

Do I understand it correctly that this will "only" fix the issue in most cases since delivery of webhooks is not reliable?

Yes, good old CAP strikes again. There's no way that we could possibly hook into an API in a way that is both:

Synchronous (happens before the write completes)
Exhaustive (never misses any events)
Reliable (doesn't bring down (parts of) the hooked API when it fails)

We could technically sacrifice any of these, but (to me) 2 and 3 are both far more vital than 1.

Do webhooks require any kind of extra privileges that we currently don't need?

The operator and installer both need permission to work with MutatingWebhookConfiguration objects. Additionally, webhooks do introduce the risk of bugs and/or misconfiguration wreaking havoc on the larger system.

This design would allow users who are uncomfortable with running our webhook to simply not deploy the MWC, at the cost of still having to deal with this extra restart.

Do webhooks work on OpenShift?

I haven't tested the spike against OS specifically, but Google seems to believe yes. I also remember noticing that certain parts of OS itself is implemented using webhooks.

Does this need to be presented to the team again? I assume no because I believe this is the outcome of the last team session, right?

It might be worth bringing up at tomorrow's arch meeting. I'll add it to the list.

lfrancke commented 1 year ago

Yes, good old CAP strikes again. There's no way that we could possibly hook into an API in a way that is both:

That's fine for me as the worst that can happen ist hat we get the extra restart, right? Similar to the answer to question two.

Thank you. If the team is fine with this then I am as well.

fhennig commented 1 year ago

I think we didn't really discuss this broadly, I'll put it on the agenda for tomorrow, so we can have a brief look

nightkr commented 1 year ago

That's fine for me as the worst that can happen ist hat we get the extra restart, right?

Yes.

nightkr commented 1 year ago

After some discussion with @sbernauer and @siegfriedweber we decided to push on with the webhook approach.

lfrancke commented 1 year ago

So this means refinement is done?

nightkr commented 1 year ago

I'd say so, yes. There's still the question of how we want to prioritize this, but I guess that's your department... :P

sbernauer commented 1 year ago

Teo did a handover with Siggi

stackabletech / commons-operator

StatefulSet restarter always restarts replica 0 immediately after initial rollout #111

Current behaviour

Expected Behaviour

Why does this happen?

What can we do about it?

Definition of done

Observed Situation

Investigation Worthy Threads

Possible Next Steps