[Bug]: Operator considers entire cluster failed if it times out on rolling one broker

Bug Description

I was upgrading operator (0.35.0 -> 0.36.1) when I ran into an issue with the operator putting the entire cluster down. Before the upgrade, terminationGracePeriodSeconds was not set (so default 30s). Cluster starts rolling the first node, which unfortunately doesn't have time to stop gracefully before it's killed and starts recovery. operationTimeoutMs was set for 2h. The cluster eventually times out on waiting for the pod to be ready, and sets FatalError on the kafka instance:

    - lastTransitionTime: "2024-09-26T14:47:54.830455552Z"
      message: Error while waiting for restarted pod main-kafka-1 to become ready
      reason: FatalProblem
      status: "True"
      type: NotReady

As a consequence, the endpoints are retired and healthy brokers are unreachable, isolating the entire cluster. This is somewhat related https://github.com/strimzi/strimzi-kafka-operator/issues/5263.

Steps to reproduce

No response

Expected behavior

The operator would not set kafka instance to NotReady if a single broker has issues.

Strimzi version

0.36.1

Kubernetes version

1.28

Installation method

HelmRelease via flux v1

Infrastructure

OKD 4.15.0-0.okd-2024-03-10-010116 (OpenShift upstream)

Configuration files and logs

No response

Additional context

No response

strimzi / strimzi-kafka-operator