Open k-wall opened 3 years ago
This specific problem should be easily fixed by adding a check for the continued existence of the CR into the loop and exiting when it doesn't exist.
More generally, this probably isn't the only place where this could happen. Zookeeper rolls are another possibility. Adding checks in many places doesn't feel like a great solution, but let's see if this is enough in practice and be prepared to revisit if it's not.
@tombentley - I've been brainstorming on this one a bit and I'm thinking about the following approach. Let me know if this sounds reasonable or not.
Reconciliation
: final Action action
and boolean cancelled
Reconciliation
s in the AbstractOperator
, including those running as well as those waiting for a lockReconciliation
to the collection (in AbstractOperator#reconcile(Reconciliation)
and the action of the new object is DELETED
, find other entries of the same kind
, namespace
, and name
(i.e. getLockName
) and set cancelled
to true
.Reconciliation#cancelled
to determine whether looping should continue, etc. In particular, the KafkaRoller
for a start.I have the same problem. If there's a mistake in the configuration of the kafka cluster, KafkaRoller enters a loop and it does not recover; keeps trying to roll the brokers and it does not stop, not even when the kafka cluster is deleted. It actually takes a long time to stop trying to reconcile the cluster.
Triaged on 21.7.2022: This issue should be used to track improvements to the KafkaRoller to improve its timeouts when it cannot connect to Kafka where currently, the exponential backoff can cause it to be quite long before it actually ends in some error state which is handled. This might be improved with some changed and improvements to the KafkaRoller (e.g. better agent reporting back broker states etc.). @tombentley is leading the work on this redesign.
Hi @scholzj any updates on this ? I'm facing the same issue
2022-10-14 19:53:48 INFO KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-0} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod strimzi-kafka-cluster-oci-preprod-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 16000ms
2022-10-14 19:54:20 INFO KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-2} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2022-10-14 19:54:50 INFO KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-1} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2022-10-14 19:54:51 INFO KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-0} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod strimzi-kafka-cluster-oci-preprod-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 32000ms
2022-10-14 19:55:22 INFO KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-2} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2022-10-14 19:55:53 INFO KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-1} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2022-10-14 19:55:54 INFO KafkaRoller:306 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-0} due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod strimzi-kafka-cluster-oci-preprod-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 64000ms
2022-10-14 19:57:37 INFO KafkaRoller:299 - Reconciliation #359(timer) Kafka(strimzi-kafka-preprod/strimzi-kafka-cluster-oci-preprod): Could not roll pod {strimzi-kafka-cluster-oci-preprod-kafka-2}, giving up after 10 attempts. Total delay between attempts 127750ms
Pods are crashing due to disk issue "No space left on device"
FileSystemResizePending True Mon, 01 Jan 0001 00:00:00 +0000 Fri, 14 Oct 2022 22:02:28 +0200 Waiting for user to (re-)start a pod to finish file system resize of volume on node.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Resizing 3m14s (x1299 over 2d1h) external-resizer blockvolume.csi.oraclecloud.com External resizer is resizing volume csi-c652ce27-1381-4e5a-acdf-06d14ae65024
Normal FileSystemResizeRequired 3m13s (x1297 over 2d1h) external-resizer blockvolume.csi.oraclecloud.com Require file system resize of volume on node
@eslam-gomaa I'm not sure how/if this is related to this issue. So it might be worth discussing it separately in separate discussion. In any case, there are no updates on this issue.
Thanks @scholzj for replying I created an issue for it https://github.com/strimzi/strimzi-kafka-operator/issues/7484
Describe the bug
Strimzi (latest)
If a kafka is deleted (kafka CR removed), it is possible for KafkaRoller to continue to try act on the kafka for many minutes, needlessly, continually failing at each retry. This is wasteful of resources.
In our use-case case, Strimzi will be managing a large number of kafkas with the set of kafka mutating relatively quickly. There is the real possibility that useful work is delayed.
In the example I highlight below, Strimzi is still needlessly processing the CR (Reconciliation no. 20), 17 minutes after it was deleted.
To Reproduce https://github.com/k-wall/strzimi-del-problem/blob/main/create_kafkas.sh
Steps to reproduce the behavior:
STRIMZI_NAMESPACE
*
following docscreate_kafkas.sh 50
to create 50 kafkaoc delete k -l kafka=true --all-namespaces
Expected behavior Efficient handling of the kafka delete case, short circuiting long running expensive tasks.
Environment (please complete the following information):
YAML files and logs
Attached
issue_4869.log
Additional context Add any other context about the problem here.