linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 585 forks source link

Cruise control rebalance hangs when setting static broker configurations along with `default.replication.throttle` #2078

Closed rosakng closed 10 months ago

rosakng commented 10 months ago

Today the way Cruise Control removes broker throttle rate configurations during a rebalance is to delete the current configurations for leader and follower replication throttled rate, then expect those configurations to be null after removing them and before continuing on with the rebalance. Though this is true in the case of removing dynamic broker configurations, however, if there is a static broker configuration set for leader.replication.throttled.rate or follower.replication.throttled.rate, the configurations will never equal in the check, leaving the rebalance to hang and exhaust all of its retries without continuing.

The state of cruise control then appears to have paused in either moving more data or fully completing the process. Cruise control should be able to run rebalances even if a static configuration is set for leader.replication.throttled.rate or follower.replication.throttled.rate along with default.replication.throttle.

rosakng commented 10 months ago

A proposed solution when clearing throttles is to remove the throttle rate configuration if the current throttle rate is of a dynamic broker config source type, such that only dynamic broker configurations are being removed. In order to support static configurations being returned after a previous dynamic configuration is removed in waitForConfigs is to set the expected configuration value to the current static configuration value, only if the current configuration is set static-ly.

👋 @CCisGG , would you happen to know if this is a known issue or something I can contribute for a fix?

CCisGG commented 10 months ago

Hi @rosakng , that's a good finding. Thanks for taking investigation on this!

From my understand, when there is any static config for throttling on Kafka side, CC does not have a way to remove the throttling. In that case, if user ask CC to remove throttling during rebalance, wouldn't it make more sense for CC to set the dynamic config value to 0? In that way, the value 0 should override the static config and the throttling should still work. The way you are proposal has a potential issue: when user set throttling to 0 for rebalance, they will assume CC rebalances with no throttling, while the rebalance is actually throttled by broker side config. Does that make sense?

rosakng commented 10 months ago

HI @CCisGG Thanks for your reply! My concern with setting default.replication.throttle to 0 is that based off of the config specifications for follower and leader throttled rates, the upper bound on inbound/outbound replication traffic would be 0 bytes/second, causing there to be no partition movements.

I tried testing this out and saw that no movements were being made:

image

In addition, CC would not use the default.replication.throttle config if an existing static/dynamic broker configuration is already in place because of the setIfUnset logic.

I propose that if a user asks CC to rebalance using default.replication.throttle config, the value should overwrite any static/dynamic broker config that may already exist on the Kafka side. What do you think?

CCisGG commented 10 months ago

My concern with setting default.replication.throttle to 0 is that based off of the config specifications for follower and leader throttled rates, the upper bound on inbound/outbound replication traffic would be 0 bytes/second, causing there to be no partition movements.

@rosakng Just realize it, and you are absolutely right. I think override the config makes more senses. Do we need to remove the override after the rebalance? I'm not super sure about it but might be something to consider. Please feel free to work on that and I'll review it as needed.

rosakng commented 10 months ago

@CCisGG Yup, it seems to make sense to remove the override after the rebalance just like how the throttled rates are removed when it is cleared. Will have a PR up soon!