linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Cruise control believes there is an ongoing execution when there is not #969

Closed pmbuko closed 4 years ago

pmbuko commented 5 years ago

Testing cruise control 2.0.69 on a kafka cluster running 2.3.1, I killed a broker and CC took no action. These are the relevant logs:

[2019-10-04 12:23:24,272] INFO Skipping goal violation detection because there are dead brokers/disks in the cluster, flawed brokers: [30] (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector)
[2019-10-04 12:23:33,815] WARN BROKER_FAILURE detected {
    Broker 30 failed at 04/10/2019 11:44:42
}. Self healing start time 04/10/2019 12:14:42. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)
[2019-10-04 12:23:33,815] WARN BROKER_FAILURE detected {
    Broker 30 failed at 04/10/2019 11:44:42
}. Self healing start time 04/10/2019 12:14:42. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)
[2019-10-04 12:23:33,815] WARN BROKER_FAILURE detected {
    Broker 30 failed at 04/10/2019 11:44:42
}. Self healing start time 04/10/2019 12:14:42. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)
[2019-10-04 12:23:33,815] WARN Self-healing has been triggered. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)
[2019-10-04 12:23:33,815] WARN Self-healing has been triggered. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)
[2019-10-04 12:23:33,815] WARN Self-healing has been triggered. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)
[2019-10-04 12:23:33,833] INFO Fixing anomaly {
    Broker 30 failed at 04/10/2019 11:44:42
} (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-10-04 12:23:33,833] INFO Fixing anomaly {
    Broker 30 failed at 04/10/2019 11:44:42
} (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-10-04 12:23:33,833] INFO Fixing anomaly {
    Broker 30 failed at 04/10/2019 11:44:42
} (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-10-04 12:23:33,833] WARN [BROKER_FAILURE-e3d-a747-76fad8a5970a] Self-healing failed to start. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-10-04 12:23:33,833] WARN [BROKER_FAILURE-e3d-a747-76fad8a5970a] Self-healing failed to start. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-10-04 12:23:33,833] WARN [BROKER_FAILURE-e3d-a747-76fad8a5970a] Self-healing failed to start. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-10-04 12:23:33,833] ERROR Uncaught exception in anomaly handler. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
java.lang.IllegalStateException: Cannot execute new proposals while there is an ongoing execution.
    at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.sanityCheckDryRun(KafkaCruiseControl.java:190)
    at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.RemoveBrokersRunnable.removeBrokers(RemoveBrokersRunnable.java:127)
    at com.linkedin.kafka.cruisecontrol.detector.BrokerFailures.fix(BrokerFailures.java:56)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.fixAnomalyInProgress(AnomalyDetector.java:432)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.processAnomalyInProgress(AnomalyDetector.java:308)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.handleAnomalyInProgress(AnomalyDetector.java:291)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.run(AnomalyDetector.java:262)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

ExecutorState: {state: NO_TASK_IN_PROGRESS}

No reassignment znode exists in zookeeper.

efeg commented 4 years ago

@pmbuko Thanks for reporting this issue!

This issue can happen if the broker failure detector detects a failed broker when it starts -- i.e. the Kafka cluster already has a failed broker when Cruise Control (CC) starts.

If this is the case, due to a race condition between (1) servlet initialization thread (see code) and (2) broker failure detector self-healing thread (see code), it is possible that the executor has not learnt about the user task manager, yet. When this is the case, the self healing action can cause the ProposalExecutionRunnable thread fail (see code) with a bad state (i.e. with _hasOngoingExecution=true), leading CC to believe that there is an ongoing execution when there is not.