linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Infinite loop after BROKER_FAILURE - Broker does not exist #402

Closed jocelyndrean closed 5 years ago

jocelyndrean commented 5 years ago

Hi guys ! After a double BROKER_fAILURE, CC seems to be stuck in an infinite loop. I loosed 2 brokers in a cluster of 3 brokers, I loosed Broker 1063 at 09:06:41 and 1068 at 09:08:19. Since then, CC tries to fix the anomaly but gets exception "java.lang.IllegalArgumentException: Broker [1063, 1068] does not exist."

Does anyone have a clue on it ? :)

[2018-11-13 10:08:54,828] WARN BROKER_FAILURE detected {
    Broker 1068 failed at 13/11/2018 09:08:19
    Broker 1063 failed at 13/11/2018 09:06:41
}. Self healing start time 13/11/2018 09:36:41. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)

[2018-11-13 10:08:54,828] WARN Self-healing has been triggered. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)

[2018-11-13 10:08:54,833] INFO Fixing anomaly {
    Broker 1068 failed at 13/11/2018 09:08:19
    Broker 1063 failed at 13/11/2018 09:06:41
} (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)

[2018-11-13 10:08:54,839] WARN Anomaly handler received exception when try to fix the anomaly {
    Broker 1068 failed at 13/11/2018 09:08:19
    Broker 1063 failed at 13/11/2018 09:06:41
}. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: java.lang.IllegalArgumentException: Broker [1063, 1068] does not exist.
    at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.decommissionBrokers(KafkaCruiseControl.java:184)
    at com.linkedin.kafka.cruisecontrol.detector.BrokerFailures.fix(BrokerFailures.java:44)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.fixAnomaly(AnomalyDetector.java:268)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.run(AnomalyDetector.java:203)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Broker [1063, 1068] does not exist.
    at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.sanityCheckBrokerPresence(KafkaCruiseControl.java:779)
    at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.decommissionBrokers(KafkaCruiseControl.java:166)
    ... 10 more

[2018-11-13 10:08:54,839] WARN BROKER_FAILURE detected {
    Broker 1068 failed at 13/11/2018 09:08:19
    Broker 1063 failed at 13/11/2018 09:06:41
}. Self healing start time 13/11/2018 09:36:41. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)

[2018-11-13 10:08:54,839] WARN Self-healing has been triggered. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)

[2018-11-13 10:08:54,844] INFO Fixing anomaly {
    Broker 1068 failed at 13/11/2018 09:08:19
    Broker 1063 failed at 13/11/2018 09:06:41
} (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)

[2018-11-13 10:08:54,849] WARN Anomaly handler received exception when try to fix the anomaly {
    Broker 1068 failed at 13/11/2018 09:08:19
    Broker 1063 failed at 13/11/2018 09:06:41
}. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: java.lang.IllegalArgumentException: Broker [1063, 1068] does not exist.
    at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.decommissionBrokers(KafkaCruiseControl.java:184)
    at com.linkedin.kafka.cruisecontrol.detector.BrokerFailures.fix(BrokerFailures.java:44)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.fixAnomaly(AnomalyDetector.java:268)
    at com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector$AnomalyHandlerTask.run(AnomalyDetector.java:203)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Broker [1063, 1068] does not exist.
    at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.sanityCheckBrokerPresence(KafkaCruiseControl.java:779)
    at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.decommissionBrokers(KafkaCruiseControl.java:166)
    ... 10 more
jocelyndrean commented 5 years ago

I had to restart cruise-control and rolling restart Kafka to fix this issue

efeg commented 5 years ago

Hi @jocelyndrean, This is a known issue that has been fixed in PR https://github.com/linkedin/cruise-control/pull/352. Would you use a Cruise Control version after this fix and let us know if it is not resolved? -- i.e. 0.1.10 or later.

jocelyndrean commented 5 years ago

Hello @efeg ! Thanks for your feedback, I'm running version 2.0.8

jocelyndrean commented 5 years ago

I was able to reproduce it today. With a cluster of 3 brokers (IDS : 1066, 1067, 1069) :

efeg commented 5 years ago

@jocelyndrean Ah I assumed that you were using a released version for the master branch -- i.e. 0.1.* -- not the version of CC that supports Kafka 2.0 -- i.e. CC versions 2.* . It turns out that that particular patch that I referred to above (i.e. https://github.com/linkedin/cruise-control/pull/352) was unintentionally forgotten to be cherry-picked in migrate_to_kafka_2_0 branch; hence, the version 2.0.8 was missing that particular fix.

I just created version 2.0.9 with the fix (see https://github.com/linkedin/cruise-control/releases/tag/2.0.9), which should resolve the issue. Sorry for the confusion, thanks for reporting this -- hope it helps!

jocelyndrean commented 5 years ago

Thanks for this release 2.0.9 :)