strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.86k stars 1.3k forks source link

[Bug]: Operator considers entire cluster failed if it times out on rolling one broker #10647

Closed aneagoe closed 1 month ago

aneagoe commented 1 month ago

Bug Description

I was upgrading operator (0.35.0 -> 0.36.1) when I ran into an issue with the operator putting the entire cluster down. Before the upgrade, terminationGracePeriodSeconds was not set (so default 30s). Cluster starts rolling the first node, which unfortunately doesn't have time to stop gracefully before it's killed and starts recovery. operationTimeoutMs was set for 2h. The cluster eventually times out on waiting for the pod to be ready, and sets FatalError on the kafka instance:

    - lastTransitionTime: "2024-09-26T14:47:54.830455552Z"
      message: Error while waiting for restarted pod main-kafka-1 to become ready
      reason: FatalProblem
      status: "True"
      type: NotReady

As a consequence, the endpoints are retired and healthy brokers are unreachable, isolating the entire cluster. This is somewhat related https://github.com/strimzi/strimzi-kafka-operator/issues/5263.

Steps to reproduce

No response

Expected behavior

The operator would not set kafka instance to NotReady if a single broker has issues.

Strimzi version

0.36.1

Kubernetes version

1.28

Installation method

HelmRelease via flux v1

Infrastructure

OKD 4.15.0-0.okd-2024-03-10-010116 (OpenShift upstream)

Configuration files and logs

No response

Additional context

No response