Remove-brokers rebalancing seems to get stuck by race condition

scholzj commented 1 month ago

When un-empty nodes are scaled down, the scale-down is blocked ad the nodes need to be first cleaned up for example using the remove-brokers feature in Cruise Control. Once the scaled-down nodes are empty, CO will execute the scale-down and delete them. But it seems that there is a space for a race condition between the KafkaAssemblyOperator and KafkaRebalanceAssemblyOperator:

The remove brokers rebalance is ongoing and KafkaRebaanceAssemblyOperator marks the KafkaRebalance resource as Rebalancing and periodically (every 2 minutes) checks the progress
KafkaAssemblyOperator sees that the nodes are already empty and proceeds to scale-down the broker and roll Cruise Control with the new cluster configuration

Later (after the CC is rolled) the KafkaRebalanceAssemblyOperator starts another reconciliation round. But it seems that:

Cruise Control does not like the request anymore and throws exception:

com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: java.lang.IllegalArgumentException: Broker 14 does not exist.

The KafkaRebalanceAssemblyOperator tries to recreate it and seems to get stuck:

colog | grep "#313(timer)"
2024-09-23 21:13:37 INFO  AbstractOperator:266 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): KafkaRebalance my-cluster-auto-rebalancing-remove-brokers will be checked for creation or modification
2024-09-23 21:13:37 INFO  KafkaRebalanceAssemblyOperator:317 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Rebalance action is performed and KafkaRebalance resource is currently in [Rebalancing] state
2024-09-23 21:13:37 INFO  KafkaRebalanceAssemblyOperator:854 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Getting Cruise Control rebalance user task status
2024-09-23 21:13:37 WARN  KafkaRebalanceAssemblyOperator:863 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): User task 670c1383-aa04-4979-8cc6-41fe9f69efce not found, going to generate a new proposal
2024-09-23 21:13:37 INFO  KafkaRebalanceAssemblyOperator:1113 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Requesting Cruise Control rebalance [dryrun=true]
2024-09-23 21:14:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
2024-09-23 21:15:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
2024-09-23 21:16:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
2024-09-23 21:17:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
2024-09-23 21:18:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
2024-09-23 21:19:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
2024-09-23 21:20:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress

ppatierno commented 1 month ago

Triaged on 03.10.2024: it needs to be investigated and fixed.

ppatierno commented 1 month ago

For more information, the logs here contains a failure on the STs (linked here) related to this issue: https://dev.azure.com/cncf/strimzi/_build/results?buildId=180961&view=artifacts&pathAsName=false&type=publishedArtifacts

ppatierno commented 1 month ago

I had an investigation on this issue even related to the auto-rebalancing logic (where it's mostly failing in the STs above). After CC is rolled, because brokers are finally scaled down (rebalancing was done), the KafkaRebalanceAssemblyOperator asks for the task status (in order to update the KafkaRebalance resource as Ready because rebalancing is done) but the Cruise Control JSON response is empty (CC was restarted without any memory of the previous running tasks). By default we are going to re-issue a new rebalance proposal request, which doesn’t work in all scenarios (i.e. the remove_brokers is an example when the brokers to remove don’t exist anymore).

https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java#L766

I think the only way is handling errors case by case, so re-issuing the rebalance proposal request could be ok the first time but if it returns a clearer error about what’s going wrong (i.e. a broker which does not exist), the KafkaRebalance should be updated in NotReady state with the error message. We already have the handling of this specific errors in the Cruise Control API implementation class but we are using it just for testing it, not in a real use case like this.

https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java#L224

What I am not sure right now is why it's not already updating the KafkaRebalance with the error on the new issued rebalance request.

For this specific example, the NotReady state could look as wrong because in the end the rebalancing happened, it’s just CC restarted and losing memory about that. But it seems the only way to go. Also, related to the auto-rebalancing, if the KafkaRebalance ends in NotReady state, it’s automatically deleted by the reconciler which is what we want.

So I think ending in NotReady state even during a manual rebalancing, the user can figure out that brokers were scaled down, the error in the KafkaRebalance reports that brokers don’t exist anymore so they can understand that rebalancing was done anyway and they can delete the resource.

@ShubhamRwt I hope the above makes sense and could be helpful to your resolution.

ShubhamRwt commented 1 month ago

Thanks Paolo, yes it makes things more clear.

strimzi / strimzi-kafka-operator

Remove-brokers rebalancing seems to get stuck by race condition #10631