linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

self healing for one broker failure ran two times #1516

Closed jiao-zhangS closed 3 years ago

jiao-zhangS commented 3 years ago

Hi, we are using 2.4.36 and enabled 'self.healing.broker.failure.enabled'. We observed when one broker failed, self healing was executed after 'broker.failure.self.healing.threshold.ms'(we set it as 5 mins). And self healing was triggered again(out of expectation) after first time execution is finished. We are suspecting the same anomaly was put back to queue again as here's code shows because 'skipReportingIfNotUpdated' set as 'false' here. Is my suspect correct? any clues why two times execution occurred for one broker's failure. Another suspicious point is why the same log are logged twice almost at the same timing..

Following are some logs which can help describe what happened.

At '2021-04-15 20:00:36,773' the second timed execution for self healing is started.

[2021-04-15 20:00:36,773] INFO Fixing the anomaly {Fixable broker failures detected: {Broker 4 failed at 15/04/2021 18:08:13}}. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManager)
[2021-04-15 20:00:36,773] INFO Fixing the anomaly {Fixable broker failures detected: {Broker 4 failed at 15/04/2021 18:08:13}}. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManager)
[2021-04-15 20:00:36,774] INFO [d7cd135a-7dc7-4e23-b18d-03e5f4e3a877] Self-healing started successfully. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManager)
[2021-04-15 20:00:36,774] INFO [d7cd135a-7dc7-4e23-b18d-03e5f4e3a877] Self-healing started successfully. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetectorManager)
jiao-zhangS commented 3 years ago

sorry, I figured out this. It was due to our cluster has some one-replica topics, so after first time's self healing, failed broker is still detected(because it's out of clusters and has assigned replicas). Then following up self-healing was triggered but did nothing. Sorry for bothering, okay to close this.