linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

upgraded CC(2.5 branch) and dead-broker auto healing function error out #1799

Closed bpux closed 1 year ago

bpux commented 2 years ago

Hi, we run cruise control (pulled from '[migrate_to_kafka_2_5]' branch, updated to [Update README regarding Kafka 3.0 and 3.1 support]), with 'auto-heal' enabled for BROKER_FAILURE.

After upgraded cc, we found out, when there was a broker dead, CC detected the failure, but the executor error out. By looking into the code/log, the error seems coming from trying to set 'throttle' to the dead broker.

I enabled the debug and find these error from log...

[2022-03-07 15:26:45,128] ERROR error when trying to get entity config for :ConfigResource(type=BROKER, name='1453201732')  (com.linkedin.kafka.cruisecontrol.executor.ReplicationThrottleHelper)
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:180) ~[kafka-clients-3.1.0.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.ReplicationThrottleHelper.getEntityConfigs(ReplicationThrottleHelper.java:205) [cruise-control-2.5.86-SNAPSHOT.jar:?]       

our kafka version is 2.7.1 , I also re-produced it with kafka 2.6.2 by

this auto-heal was running fine in previous version before we updated CC. I'm not sure what change cause it(CC or kafka-client API), or maybe something wrong on our configurations, which we have not changed anything? for now, I've patched our version(can submit a PR), but would like to know more about the issue. thanks!

HerveRiviere commented 2 years ago

We also hit this issue with a 2.5.86 CC version and Kafka 2.7.1.

Not for self healing but a regular 'add_broker' operation as broker_id 41 was dead.

CC was crashing with the timeout exception above.

I initially tried to patch the code by increasing the timeout from 30 sec to 3 min. No success.

Then I did this hack

        if(brokers!=41){// 41 is the stopped broker
          setThrottledRateIfUnset(broker);
        }

And rebalance run smoothly.

@bpux Is your patch available somewhere on github or can you submit a PR ? It will save us some time. Thanks !