k-wall commented 3 years ago

Describe the bug

Strimzi (latest)

If a kafka is deleted (kafka CR removed), it is possible for KafkaRoller to continue to try act on the kafka for many minutes, needlessly, continually failing at each retry. This is wasteful of resources.

In our use-case case, Strimzi will be managing a large number of kafkas with the set of kafka mutating relatively quickly. There is the real possibility that useful work is delayed.

In the example I highlight below, Strimzi is still needlessly processing the CR (Reconciliation no. 20), 17 minutes after it was deleted.

2021-04-30 11:04:19 INFO  OperatorWatcher:40 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Kafka foo-fzz4yx57aem1j0b in namespace foo-fzz4yx57aem1j0b was ADDED
..
2021-04-30 11:06:40 INFO  OperatorWatcher:40 - Reconciliation #114(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Kafka foo-fzz4yx57aem1j0b in namespace foo-fzz4yx57aem1j0b was DELETED

2021-04-30 11:07:48 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 1 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
...
2021-04-30 11:07:48 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 2 after delay of 0 MILLISECONDS
2021-04-30 11:08:18 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 2 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-04-30 11:08:18 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 0 after delay of 250 MILLISECONDS
2021-04-30 11:08:48 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-04-30 11:08:48 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 1 after delay of 250 MILLISECONDS
2021-04-30 11:09:18 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 1 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
....
2021-04-30 11:23:53 INFO  KafkaRoller:289 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 2, giving up after 10 attempts. Total delay between attempts 127750ms

To Reproduce https://github.com/k-wall/strzimi-del-problem/blob/main/create_kafkas.sh

Steps to reproduce the behavior:

Install Strimzi using quick start, configure for STRIMZI_NAMESPACE * following docs
create_kafkas.sh 50 to create 50 kafka
wait until approximately 50% have become ready.
oc delete k -l kafka=true --all-namespaces
Watch logs

Expected behavior Efficient handling of the kafka delete case, short circuiting long running expensive tasks.

Environment (please complete the following information):

Strimzi version: 0.22.1
Installation method: Yaml
Kubernetes cluster: 4.7.2
Infrastructure: AWS multi region.

YAML files and logs

Attached

issue_4869.log

Additional context Add any other context about the problem here.

tombentley commented 3 years ago

This specific problem should be easily fixed by adding a check for the continued existence of the CR into the loop and exiting when it doesn't exist.

More generally, this probably isn't the only place where this could happen. Zookeeper rolls are another possibility. Adding checks in many places doesn't feel like a great solution, but let's see if this is enough in practice and be prepared to revisit if it's not.

MikeEdgar commented 3 years ago

@tombentley - I've been brainstorming on this one a bit and I'm thinking about the following approach. Let me know if this sounds reasonable or not.

Add fields to Reconciliation: final Action action and boolean cancelled
Maintain a collection of currently-active Reconciliations in the AbstractOperator, including those running as well as those waiting for a lock
When adding a new Reconciliation to the collection (in AbstractOperator#reconcile(Reconciliation) and the action of the new object is DELETED, find other entries of the same kind, namespace, and name (i.e. getLockName) and set cancelled to true.
In-flight reconcile processing can reference Reconciliation#cancelled to determine whether looping should continue, etc. In particular, the KafkaRoller for a start.

redsk commented 2 years ago

I have the same problem. If there's a mistake in the configuration of the kafka cluster, KafkaRoller enters a loop and it does not recover; keeps trying to roll the brokers and it does not stop, not even when the kafka cluster is deleted. It actually takes a long time to stop trying to reconcile the cluster.

scholzj commented 2 years ago

Triaged on 21.7.2022: This issue should be used to track improvements to the KafkaRoller to improve its timeouts when it cannot connect to Kafka where currently, the exponential backoff can cause it to be quite long before it actually ends in some error state which is handled. This might be improved with some changed and improvements to the KafkaRoller (e.g. better agent reporting back broker states etc.). @tombentley is leading the work on this redesign.

4856 is used to keep track of the whole reconciliation continuing when the CR is deleted which is a separate problem affecting also other parts than just the KafkaRoller.

eslam-gomaa commented 2 years ago

Hi @scholzj any updates on this ? I'm facing the same issue

2022-10-14 19:53:48 INFO  KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-0} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod strimzi-kafka-cluster-oci-preprod-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 16000ms
2022-10-14 19:54:20 INFO  KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-2} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2022-10-14 19:54:50 INFO  KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-1} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2022-10-14 19:54:51 INFO  KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-0} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod strimzi-kafka-cluster-oci-preprod-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 32000ms
2022-10-14 19:55:22 INFO  KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-2} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2022-10-14 19:55:53 INFO  KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-1} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2022-10-14 19:55:54 INFO  KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-0} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod strimzi-kafka-cluster-oci-preprod-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 64000ms
2022-10-14 19:57:37 INFO  KafkaRoller:299 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-2}, giving up after 10 attempts. Total delay between attempts 127750ms

A bit about my case

Pods are crashing due to disk issue "No space left on device"

the operator expanded the the PVCs size
but the pods didn't restart automatically and even when they were recreated upon termination they didn't resize the file system

FileSystemResizePending   True    Mon, 01 Jan 0001 00:00:00 +0000   Fri, 14 Oct 2022 22:02:28 +0200           Waiting for user to (re-)start a pod to finish file system resize of volume on node.
Events:
  Type    Reason                    Age                      From                                              Message
  ----    ------                    ----                     ----                                              -------
  Normal  Resizing                  3m14s (x1299 over 2d1h)  external-resizer blockvolume.csi.oraclecloud.com  External resizer is resizing volume csi-c652ce27-1381-4e5a-acdf-06d14ae65024
  Normal  FileSystemResizeRequired  3m13s (x1297 over 2d1h)  external-resizer blockvolume.csi.oraclecloud.com  Require file system resize of volume on node

scholzj commented 2 years ago

@eslam-gomaa I'm not sure how/if this is related to this issue. So it might be worth discussing it separately in separate discussion. In any case, there are no updates on this issue.

eslam-gomaa commented 2 years ago

Thanks @scholzj for replying I created an issue for it https://github.com/strimzi/strimzi-kafka-operator/issues/7484

strimzi / strimzi-kafka-operator

KafkaRoller may continue to try to roll a kafka's pods long after the kafka is deleted #4869

4856 is used to keep track of the whole reconciliation continuing when the CR is deleted which is a separate problem affecting also other parts than just the KafkaRoller.

A bit about my case