Revisit `crash_loop_limit` property

chrisseto commented 8 months ago

In https://github.com/redpanda-data/helm-charts/pull/718 we set the default value of crash_loop_limit to 5 to avoid accidentally accumulating many short log segments. While an excellent mitigation, having to wait an entire hour in the case of an accidental misconfiguration is a non starter.

We should either set crash_loop_limit to null and rely on Kubernetes backoffs or add in a bash shim that will wait to start the Redpanda process for a few seconds if the crash limiter file exists. The latter would provide enough of a window of opportunity to A) Log what is happening and B) exec in and delete the file.

cc @vuldin for thoughts as we had to recently suffer due to this feature :) and @JakeSCahill who is writing up documentation related to this.

JIRA Link: K8S-110

JFlath commented 6 months ago

At the risk of going off on an (already understood) tangent, but is the more fundemental issue to solve here not just "Redpanda should be resilient to repeated quick restarts"? As an example, even with this feature, we'd still be vulnerable if something external restarted RP many times, right?

I appreciate that's not a converstation for the helm charts repo, but raising here as I wonder if we should be linking this issue across to something in the core repo?

chrisseto commented 5 months ago

Talked with @mattschumpert about this one and I think we're going to go with a bandaid solution for now. We should add in a bash script at the start of redpanda that checks if the crash_loop_file exists. If so, log out a warning explaining what it means and the kubectl command that users can run to remove the file then sleep for ~30-60s to give plenty of time to remove said file.

chrisseto commented 4 months ago

So finally got around to giving this a shot and didn't realize until I got to testing it that the startup_log file always exists. It consists of the metadata required to decide if crash loop limit has been hit.

In order to add in a sleep delay, we'd have to parse that file which... I don't think we should. It's an internal format.

@mattschumpert @RafalKorepta any thoughts on what we should do with this? Maybe this turns into an ask of the core team to either be resilient to crash looping or add a sleep in at the redpanda level?

I don't think there's a way to hook into redpanda post crash as we'd need to use exec to correctly launch redpanda and rpk is doing the same thing.

mattschumpert commented 4 months ago

I think we should be completely OK to parse this internal format since this is about a redpanda-specific feature and this is a redpanda-specific helm chart. For the rest I suggest speaking with @mmaslankaprv about what we can do here

chrisseto commented 4 months ago

Our best bet in that case is probably to put something into rpk? Otherwise we're going to be parsing binary in bash which I'd really rather not depend on. Even in the case of RPK, the definition lives in redpanda and seems specific to how C++ marshals structs. Keep me posted on Michał's opinion!

c4milo commented 4 months ago

It feels Redpanda is going too far into managing runtime concerns that are usually up to a process supervisor to decide such as systemd, kubelet, etc.

mattschumpert commented 4 months ago

@chrisseto this needs to go back through @mmaslankaprv for a re-thinking of how this should work in K8s.

We've already proven we NEED crash loop detection on by default (that part is not negotiable).

For the rest, If we want to change the approach of that file being there, not being there, what it looks like, or , please work with @mmaslankaprv on what we should do. maybe worth a live discussion if needed

c4milo commented 4 months ago

FWIW, we had to disable this feature in Azure because it was getting on the way. So, I'm not sure it is truly needed, and if it is, it probably points to a design or implementation smell in Redpanda.

chrisseto commented 4 months ago

Do we have an idea of how many restarts are too many? I know we touched on this @mattschumpert but I don't recall the number. If we could raise the default in the chart to something that's less likely to trigger (5 is VERY low) when taking into account Kubernetes backoffs, we can move this ticket a bit further into the backlog.

Just as a spitball, what if we went to 50?

mattschumpert commented 4 months ago

@c4milo brand new clusters never need it.

you should also speak to @mmaslankaprv who has the experience to explain why it's needed

mattschumpert commented 4 months ago

@chrisseto the number was driven by incidents and there is no sense in making it very high as you just cause more damage and get into unrecoverable situations. I'll stop here and let you speak with the replication team to design jointly the proper solution

JFlath commented 4 months ago

@mattschumpert Just to note that the current implementation also causes significant issues in incidents, where it requires a good deal of extra work to undo the crash loop protection to bring a cluster back online after an issue is resolved. This is true in Cloud, but doubly so in Self-Hosted envs where the same tools may not be available.

chrisseto commented 4 months ago

@mattschumpert thanks for the additional insight on the default!

There might be an easier answer now that I think about it. The node/broker config is just a configmap being mounted to the Pod and one way to get RP out of a crash loop is to change the node/broker config. In theory, one could just increment the number in the configmap (or even make a simple white space change IIUC) to recover a crashing node.

If we test that and it works as expected/well, do we think a docs update would be a possible resolution to this issue?

redpanda-data / helm-charts

Revisit `crash_loop_limit` property #1068