Closed mhaseebmlk closed 4 years ago
Hi @mhaseebmlk
to be sure that auto-rebalancing is working, I scaled down the cluster and then added brokers back to the cluster.
Would you provide more details on how you scaled down the cluster? If you ran a remove_broker
command to scale down, CC adds the removed brokers to a recently removed brokers
set and refrains from assigning new replicas to them for a configured period of time.
You can verify this by following these steps:
Check the following configs in crusiecontrol.properties
and adjust them if needed (here are the current open source configs):
# True if recently removed brokers are excluded from optimizations during goal violation self healing, false otherwise
goal.violation.exclude.recently.removed.brokers=true
...
# The maximum time in milliseconds to retain the removal history of brokers.
removal.history.retention.time.ms=1209600000
Check the executor substate of Cruise Control. If there are recently removed or demoted brokers, you should see them in the response.
@mhaseebmlk To add on https://github.com/linkedin/cruise-control/issues/1270#issuecomment-659609862, note that for REST API requests, you can explicitly set exclude_recently_removed_brokers=false
to avoid excluding recently removed brokers. The current default value (i.e. value used if missing) for this parameter is true
(see code).
Yes, I use the remove_broker
endpoint to remove brokers and scale down the cluster.
I verified that the configs you mentioned are set in my configs file too. It also seems like the two brokers that were removed while scaling down are indeed added to the recently removed brokers
set: ExecutorState: {state: NO_TASK_IN_PROGRESS, recentlyRemovedBrokers: [4, 5]}
I will try again by adjusting the removal.history.retention.time.ms
time to something smaller and verifying that auto rebalancing is working.
@efeg that seems to have fixed the problem. Thanks for your help!
@mhaseebmlk Happy to hear that!
Cruise Control version: v0.1.10
I am testing scaling up Kafka clusters running on kubernetes using cruise control. As part of scaling up, I am creating new Kafka broker pods and then relying on cruise-control to trigger a cluster rebalance to assign partitions to the newly added brokers as per this suggestion by @efeg: https://github.com/linkedin/cruise-control/issues/344#issuecomment-426094872.
I also found a similar issue (https://github.com/linkedin/cruise-control/issues/906) to this one where the
ReplicaDistributionGoal
was missing in theanomaly.detection.goals
. Based on this, I have added theReplicaDistributionGoal
to myanomaly.detection.goals
config and I am relying on cluster rebalancing to be triggered because of this goal failing when new brokers are added.However, I noticed that cruise control only triggered a cluster rebalance and assigned partitions to the newly created brokers ONCE. In order to be sure that auto-rebalancing is working, I scaled down the cluster and then added brokers back to the cluster. But this time cruise control did not trigger a cluster rebalance, partitions were not assigned to the new brokers and the new brokers remained idle until I manually triggered a cluster rebalance.
I am attaching some screenshots from the cc-ui for reference.
Brokers 4 and 5 were the new brokers that were launched. It seems like only newly created topics were assigned to these brokers and the existing data was not rebalanced and existing partitions were not assigned to these brokers.
Cruise Control Monitor State:
Cruise Control Analyzer State: Note that the
ReplicaDistributionGoal
seems to be ready to be analyzed.Anomaly Detector State: Notice that the
ReplicaDistributionGoal
violation was detected only ONCE - this was when I added a broker earlier. But when I added more again, this was never detected and self-healing did not get triggered.Cruise Control Proposals: It seems like cruise control thinks that it has already fixed the
ReplicaDistributionGoal
violation? However, as the first screenshot shows, self-healing did not get triggered and the cluster did not auto-rebalance. The cruise control logs also did not show any log line for triggering self-healing.Configurations
These are the relevant configurations that I have setup cruise control with:
self.healing.enabled=true
default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerDiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerEvenRackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal
hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
anomaly.detection.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal