Closed pmbuko closed 4 years ago
@pmbuko Thanks for reporting this issue!
This issue can happen if the broker failure detector detects a failed broker when it starts -- i.e. the Kafka cluster already has a failed broker when Cruise Control (CC) starts.
If this is the case, due to a race condition between (1) servlet initialization thread (see code) and (2) broker failure detector self-healing thread (see code), it is possible that the executor has not learnt about the user task manager, yet. When this is the case, the self healing action can cause the ProposalExecutionRunnable
thread fail (see code) with a bad state (i.e. with _hasOngoingExecution=true
), leading CC to believe that there is an ongoing execution when there is not.
Testing cruise control 2.0.69 on a kafka cluster running 2.3.1, I killed a broker and CC took no action. These are the relevant logs:
ExecutorState: {state: NO_TASK_IN_PROGRESS}
No reassignment znode exists in zookeeper.