linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.68k stars 574 forks source link

Slow task execution due to long retry backoff in ReplicationThrottleHelper #2147

Open eazama opened 2 months ago

eazama commented 2 months ago

I've observed extremely long gaps in executions of task batches. Based on the logs, these gaps appear to be the result of ReplicationThrottleHelper submitting requests to add or remove the replication throttle rate configurations and then waiting for the change to be reflected on the broker. Specifically, there is a 10 second gap between each broker's configuration change. This can lead to minutes of time doing nothing, even on small 6 or 12 node clusters.

The specific mechanism that ReplicationThrottleHelper is using to wait for changes is to call CruiseControlMetricsUtils.retry, specifically, the overload that uses the default backoff configurations of scale=5 seconds and base=2.

I assume that the first describe request doesn't return the expected configurations for some reason, resulting in the retry loop triggering the first 10 second backoff.

10 seconds seems like an excessive amount of time to wait for the first retry, so it would be nice if the retry loop started with a much smaller scale, somewhere on the order of a few milliseconds. Because the exponential backoff has no cap, starting with a small scale is somewhat necessary to prevent the backoff from becoming unreasonably large after only one or two retries.