Closed scholzj closed 1 month ago
Triaged on 03.10.2024: it needs to be investigated and fixed.
For more information, the logs here contains a failure on the STs (linked here) related to this issue: https://dev.azure.com/cncf/strimzi/_build/results?buildId=180961&view=artifacts&pathAsName=false&type=publishedArtifacts
I had an investigation on this issue even related to the auto-rebalancing logic (where it's mostly failing in the STs above).
After CC is rolled, because brokers are finally scaled down (rebalancing was done), the KafkaRebalanceAssemblyOperator
asks for the task status (in order to update the KafkaRebalance
resource as Ready
because rebalancing is done) but the Cruise Control JSON response is empty (CC was restarted without any memory of the previous running tasks).
By default we are going to re-issue a new rebalance proposal request, which doesn’t work in all scenarios (i.e. the remove_brokers is an example when the brokers to remove don’t exist anymore).
I think the only way is handling errors case by case, so re-issuing the rebalance proposal request could be ok the first time but if it returns a clearer error about what’s going wrong (i.e. a broker which does not exist), the KafkaRebalance
should be updated in NotReady
state with the error message. We already have the handling of this specific errors in the Cruise Control API implementation class but we are using it just for testing it, not in a real use case like this.
What I am not sure right now is why it's not already updating the KafkaRebalance
with the error on the new issued rebalance request.
For this specific example, the NotReady
state could look as wrong because in the end the rebalancing happened, it’s just CC restarted and losing memory about that. But it seems the only way to go. Also, related to the auto-rebalancing, if the KafkaRebalance
ends in NotReady
state, it’s automatically deleted by the reconciler which is what we want.
So I think ending in NotReady
state even during a manual rebalancing, the user can figure out that brokers were scaled down, the error in the KafkaRebalance
reports that brokers don’t exist anymore so they can understand that rebalancing was done anyway and they can delete the resource.
@ShubhamRwt I hope the above makes sense and could be helpful to your resolution.
Thanks Paolo, yes it makes things more clear.
When un-empty nodes are scaled down, the scale-down is blocked ad the nodes need to be first cleaned up for example using the remove-brokers feature in Cruise Control. Once the scaled-down nodes are empty, CO will execute the scale-down and delete them. But it seems that there is a space for a race condition between the
KafkaAssemblyOperator
andKafkaRebalanceAssemblyOperator
:KafkaRebaanceAssemblyOperator
marks theKafkaRebalance
resource asRebalancing
and periodically (every 2 minutes) checks the progressKafkaAssemblyOperator
sees that the nodes are already empty and proceeds to scale-down the broker and roll Cruise Control with the new cluster configurationKafkaRebalanceAssemblyOperator
starts another reconciliation round. But it seems that:KafkaRebalanceAssemblyOperator
tries to recreate it and seems to get stuck: