Closed scholzj closed 4 years ago
Hello, I found this from Google when analyzing my problem.
I am caught in a loop that repeats like this:
2021-08-12 14:13:02 INFO KafkaAvailability:121 - strimzi.cruisecontrol.partitionmetricsamples/0 will be underreplicated (|ISR|=1 and min.insync.replicas=1) if broker 1 is restarted.
2021-08-12 14:13:02 INFO KafkaRoller:296 - Reconciliation #25887(timer) Kafka(kafka-devl/cluster-main): Could not roll pod 1 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod cluster-main-kafka-1 is currently not rollable, retrying after at least 32000ms
2021-08-12 14:13:04 INFO KafkaRoller:504 - Reconciliation #25887(timer) Kafka(kafka-devl/cluster-main): Pod 2 needs to be restarted. Reason: [Pod has old generation]
2021-08-12 14:13:05 INFO KafkaRoller:296 - Reconciliation #25887(timer) Kafka(kafka-devl/cluster-main): Could not roll pod 2 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod cluster-main-kafka-2 is currently the controller and there are other pods still to roll, retrying after at least 32000ms
There is no progress. I'm running AMQ Streams 1.7 (equivalent to Strimzi 0.22). The mentioned CruiseControl topic has replication factor 2.
To be fair I will admit that I'm trying something "weird": I scaled down my test Kafka cluster from 4 to 3 brokers in the custom resource to see what would happen. Is this behavior expected? I thought maybe the cluster operator would refuse to scale down the cluster, but it did not. Would it be considered my responsibility to move all replicas away from the broker that will be removed before doing so?
Before you scale down the Kafka cluster, you have to move the replicas from the last node which will be removed by the scale-down. That has to be right now done manually.
I see. Thanks for your reply! Would it be possible to instruct CruiseControl to do such an operation of "emptying" a broker?
I think that requires further development and integration.
We're having more a less the same issue. We upgraded from Kafka 2.7 to 2.8 in our Kafka CR. Now the operator is caught in a loop:
2021-10-25 12:45:53 ERROR AbstractOperator:274 - Reconciliation #34326(watch) Kafka(my-namespaces/my-cluster): createOrUpdate failed io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod my-cluster-kafka-0 is currently the controller and there are other pods still to roll at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:358) ~[io.strimzi.cluster-operator-0.24.0.jar:0.24.0] at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:277) ~[io.strimzi.cluster-operator-0.24.0.jar:0.24.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
After deleting broker-0 the cluster performed a rolling update and all was finde again. Interesting is also, that we had a partition which was not in sync before doing this. Performing rolling updated on the cluster did not fix the partition issue. Now after going to 2.8.0 AND deleting broker-0 to trigger a rolling restart this problem is gone as well. I don't think this is wanted behaviour.
When for some reason the Kafka pods are not ready / crash-looping, the KafkaRoller needs to restart them to fix the issues. But instead it is endlessly trying to connect to them and never restarts them.
Apart from not restarting the pods, it is also taking very long time to timeout while saying it timeouts only for short time. After cca 20 minutes you get the message
where it says it did the attempts in 127 seconds while in reality it took 20 minutes.
The full log is below. This should be fixed in 0.20.0 - I think this is regression.