strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.86k stars 1.3k forks source link

Remove-brokers rebalancing seems to get stuck by race condition #10631

Closed scholzj closed 1 month ago

scholzj commented 1 month ago

When un-empty nodes are scaled down, the scale-down is blocked ad the nodes need to be first cleaned up for example using the remove-brokers feature in Cruise Control. Once the scaled-down nodes are empty, CO will execute the scale-down and delete them. But it seems that there is a space for a race condition between the KafkaAssemblyOperator and KafkaRebalanceAssemblyOperator:

ppatierno commented 1 month ago

Triaged on 03.10.2024: it needs to be investigated and fixed.

ppatierno commented 1 month ago

For more information, the logs here contains a failure on the STs (linked here) related to this issue: https://dev.azure.com/cncf/strimzi/_build/results?buildId=180961&view=artifacts&pathAsName=false&type=publishedArtifacts

ppatierno commented 1 month ago

I had an investigation on this issue even related to the auto-rebalancing logic (where it's mostly failing in the STs above). After CC is rolled, because brokers are finally scaled down (rebalancing was done), the KafkaRebalanceAssemblyOperator asks for the task status (in order to update the KafkaRebalance resource as Ready because rebalancing is done) but the Cruise Control JSON response is empty (CC was restarted without any memory of the previous running tasks). By default we are going to re-issue a new rebalance proposal request, which doesn’t work in all scenarios (i.e. the remove_brokers is an example when the brokers to remove don’t exist anymore).

https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java#L766

I think the only way is handling errors case by case, so re-issuing the rebalance proposal request could be ok the first time but if it returns a clearer error about what’s going wrong (i.e. a broker which does not exist), the KafkaRebalance should be updated in NotReady state with the error message. We already have the handling of this specific errors in the Cruise Control API implementation class but we are using it just for testing it, not in a real use case like this.

https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java#L224

What I am not sure right now is why it's not already updating the KafkaRebalance with the error on the new issued rebalance request.

For this specific example, the NotReady state could look as wrong because in the end the rebalancing happened, it’s just CC restarted and losing memory about that. But it seems the only way to go. Also, related to the auto-rebalancing, if the KafkaRebalance ends in NotReady state, it’s automatically deleted by the reconciler which is what we want.

So I think ending in NotReady state even during a manual rebalancing, the user can figure out that brokers were scaled down, the error in the KafkaRebalance reports that brokers don’t exist anymore so they can understand that rebalancing was done anyway and they can delete the resource.

@ShubhamRwt I hope the above makes sense and could be helpful to your resolution.

ShubhamRwt commented 1 month ago

Thanks Paolo, yes it makes things more clear.