redpanda-data / redpanda-operator

34 stars 8 forks source link

🫐 πŸ› Operator gets stuck with bad configurations #196

Open c4milo opened 1 month ago

c4milo commented 1 month ago

What happened?

Whenever a configuration change results in a redpanda pod falling into an unschedulable or crashloop state. It is impossible to correct the situation by only fixing the CR values. The values are taking but they are not reconciled by the operator and the statefulset remains using the wrong configuration.

See screen recording in: https://redpandadata.slack.com/archives/C01H6JRQX1S/p1723752154395579?thread_ts=1723751900.722069&cid=C01H6JRQX1S

What did you expect to happen?

If we make mistakes configuring container resources and/or limits in the Redpanda Custom Resource (CR), or any other configuration resulting in a broker crashlooping. We want to be able to correct it through the Redpanda CR and see the change instantly applied by the operator. No delays.

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

```console $ helm get values -n --all # paste output here ```

Anything else we need to know?

No response

Which are the affected charts?

Operator

Chart Version(s)

```console 5.9.0 ```

Cloud provider

Azure

JIRA Link: K8S-323

JIRA Link: K8S-324

chrisseto commented 3 weeks ago

So this is a bit nastier than I thought. I was under the impression that we could just yet force in the upgrade spec but the operator itself won't update the helm release if it sees that it's unhealthy which further makes this difficult to get out of.

https://github.com/redpanda-data/redpanda-operator/blob/72ba3d3c4382c556259a49ef291204e65574d6fa/src/go/k8s/internal/controller/redpanda/redpanda_controller.go#L523-L527

For reference, this is how to set force but it doesn't really do anything given the operator's behavior.

  chartRef:
    upgrade:
      force: true

I'd vote to change the behavior to just always update the helm release regardless of it's existing status as that'll prevent users from fixing forward. @RafalKorepta WDYT?

RafalKorepta commented 3 weeks ago

Agree with you @chrisseto