linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

CC from branch migrate_to_kafka_2_5 works with errors when setting default.replication.throttle in properties #1865

Open khodyrevyurii opened 2 years ago

khodyrevyurii commented 2 years ago

Hi.

We encountered an error with cruise control when using the parameter default.replication.throttle when we try to remove the dead broker from cluster

Error:

[2022-07-13 17:51:32,153] ERROR Executor got exception during execution (com.linkedin.kafka.cruisecontrol.executor.Executor)
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:180) ~[kafka-clients-3.1.0.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.ReplicationThrottleHelper.getEntityConfigs(ReplicationThrottleHelper.java:203) ~[cruise-control-2.5.95-SNAPSHOT.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.ReplicationThrottleHelper.getBrokerConfigs(ReplicationThrottleHelper.java:198) ~[cruise-control-2.5.95-SNAPSHOT.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.ReplicationThrottleHelper.setThrottledRateIfUnset(ReplicationThrottleHelper.java:169) ~[cruise-control-2.5.95-SNAPSHOT.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.ReplicationThrottleHelper.setThrottles(ReplicationThrottleHelper.java:68) ~[cruise-control-2.5.95-SNAPSHOT.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.Executor$ProposalExecutionRunnable.interBrokerMoveReplicas(Executor.java:1345) ~[cruise-control-2.5.95-SNAPSHOT.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.Executor$ProposalExecutionRunnable.execute(Executor.java:1177) ~[cruise-control-2.5.95-SNAPSHOT.jar:?]
        at com.linkedin.kafka.cruisecontrol.executor.Executor$ProposalExecutionRunnable.run(Executor.java:1103) ~[cruise-control-2.5.95-SNAPSHOT.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
[2022-07-13 17:51:32,154] WARN Task [ac0187f8-de9c-48b7-abe4-801c25a0269e] userPOST /kafkacruisecontrol/remove_broker execution is interrupted with exception null. (operationLogger)
[2022-07-13 17:51:32,155] INFO Execution finished. (com.linkedin.kafka.cruisecontrol.executor.Executor)
[2022-07-13 17:51:32,155] INFO Execution finished. (com.linkedin.kafka.cruisecontrol.executor.Executor)

Environment:

Step to reproduce:

  1. Set default.replication.throttle in cruisecontrol.properties (in my example default.replication.throttle=10000000)
  2. Cluster config rf=3 and min.isr=2. Create a couple of topics with 10 partitions and generate some test data
  3. Shutdown one broker from cluster
  4. After 15 minutes try to remove broker via CC rest api (/remove_broker?brokerid={{broker_id}}&dryrun=false)

Unfortunately, I have no development experience, so only guesses remain. But it seems to me that the problem occurs in this block https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_5/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java#L63

On our cluster, we temporarily solved the problem by rolling back the changes of this PR: https://github.com/linkedin/cruise-control/pull/1781

HerveRiviere commented 2 years ago

Looks like a duplicate of this issue https://github.com/linkedin/cruise-control/issues/1799 (upgraded CC(2.5 branch) and dead-broker auto healing function error out)

CCisGG commented 1 year ago

It's probably because the timeout config is set to too low by default:

https://github.com/linkedin/cruise-control/blob/e3d43d71526dbf70365607039f0bb0938f619373/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java#L203

kooli89 commented 1 year ago

The timeout is 30 seconds which should be enough to get one broker configuration via the admin client. But in the case of a dead broker the client always reaches the timeout. It doesn't make much sense to apply throttling to a dead broker, right?

CCisGG commented 1 year ago

@kooli89 You made a very good point. I think there is a need for skip setting up the throttling config for dead broker.

khodyrevyurii commented 1 year ago

The timeout is 30 seconds which should be enough to get one broker configuration via the admin client. But in the case of a dead broker the client always reaches the timeout. It doesn't make much sense to apply throttling to a dead broker, right?

Hi. Yes, that was exactly the problem. Sorry for the long reply, I was out of touch.