I was upgrading operator (0.35.0 -> 0.36.1) when I ran into an issue with the operator putting the entire cluster down.
Before the upgrade, terminationGracePeriodSeconds was not set (so default 30s). Cluster starts rolling the first node, which unfortunately doesn't have time to stop gracefully before it's killed and starts recovery. operationTimeoutMs was set for 2h. The cluster eventually times out on waiting for the pod to be ready, and sets FatalError on the kafka instance:
- lastTransitionTime: "2024-09-26T14:47:54.830455552Z"
message: Error while waiting for restarted pod main-kafka-1 to become ready
reason: FatalProblem
status: "True"
type: NotReady
Bug Description
I was upgrading operator (0.35.0 -> 0.36.1) when I ran into an issue with the operator putting the entire cluster down. Before the upgrade,
terminationGracePeriodSeconds
was not set (so default 30s). Cluster starts rolling the first node, which unfortunately doesn't have time to stop gracefully before it's killed and starts recovery.operationTimeoutMs
was set for 2h. The cluster eventually times out on waiting for the pod to be ready, and setsFatalError
on the kafka instance:As a consequence, the endpoints are retired and healthy brokers are unreachable, isolating the entire cluster. This is somewhat related https://github.com/strimzi/strimzi-kafka-operator/issues/5263.
Steps to reproduce
No response
Expected behavior
The operator would not set kafka instance to
NotReady
if a single broker has issues.Strimzi version
0.36.1
Kubernetes version
1.28
Installation method
HelmRelease via flux v1
Infrastructure
OKD 4.15.0-0.okd-2024-03-10-010116 (OpenShift upstream)
Configuration files and logs
No response
Additional context
No response