linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Rebalance failure with "Cannot execute new proposals while there are ongoing partition reassignments initiated by external agent" #1392

Closed talonx closed 3 years ago

talonx commented 3 years ago

This is CruiseControl running in a 4 broker Kafka cluster managed by the Strimzi operator on GKE. CC is also managed by the operator in this deployment. Versions : Strimzi operator: 0.18.0 Kafka : 2.5.0 Cruise Control : 2.0.103 (I had to determine this by looking at the jar - cruise-control-2.0.103.jar since there seemed to be no other way)

When I attempt a rebalance call using the REST API - /kafkacruisecontrol/rebalance?dryrun=false&verbose=true invoked using curl, I get this error message - "Cannot execute new proposals while there are ongoing partition reassignments initiated by external agent"

However, there was no rebalance call initiated by anyone else (including the Strimzi supplied configuration which also supports goals). I verified this by invoking /state on CC.

Is there anything I am missing here?

Lincong commented 3 years ago

@talonx The reason is likely that you are using an inconsistent version of CC to manage the Kafka cluster and this version of CC creates a znode (admin/reassign_partitions) which is not supported after Kafka 2.4

  1. Kill the running CC instance
  2. Delete the admin/reassign_partitions znode
  3. Delete the controller Zk node which triggers the controller bounce (which cleans up the controller internal state).
  4. Re-deploy the correct version of CC (please refer to environment requirements)
talonx commented 3 years ago

@Lincong Thanks for the response. The Strimzi operator deploys a supported version of CC along with Kafka, so it cannot be an inconsistent version. However, I upgraded the operator to 0.20.0 which has a newer version of CC (same Kafka version) and the problem went away. So looks like it might have been a bug with the older version of CC.