linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Cruise control not auto rebalancing after first rebalance #1270

Closed mhaseebmlk closed 4 years ago

mhaseebmlk commented 4 years ago

Cruise Control version: v0.1.10

I am testing scaling up Kafka clusters running on kubernetes using cruise control. As part of scaling up, I am creating new Kafka broker pods and then relying on cruise-control to trigger a cluster rebalance to assign partitions to the newly added brokers as per this suggestion by @efeg: https://github.com/linkedin/cruise-control/issues/344#issuecomment-426094872.

I also found a similar issue (https://github.com/linkedin/cruise-control/issues/906) to this one where the ReplicaDistributionGoal was missing in the anomaly.detection.goals. Based on this, I have added the ReplicaDistributionGoal to my anomaly.detection.goals config and I am relying on cluster rebalancing to be triggered because of this goal failing when new brokers are added.

However, I noticed that cruise control only triggered a cluster rebalance and assigned partitions to the newly created brokers ONCE. In order to be sure that auto-rebalancing is working, I scaled down the cluster and then added brokers back to the cluster. But this time cruise control did not trigger a cluster rebalance, partitions were not assigned to the new brokers and the new brokers remained idle until I manually triggered a cluster rebalance.

I am attaching some screenshots from the cc-ui for reference.

Brokers 4 and 5 were the new brokers that were launched. It seems like only newly created topics were assigned to these brokers and the existing data was not rebalanced and existing partitions were not assigned to these brokers.

Screen Shot 2020-07-16 at 8 12 40 AM

Cruise Control Monitor State:

Screen Shot 2020-07-16 at 8 12 53 AM

Cruise Control Analyzer State: Note that the ReplicaDistributionGoal seems to be ready to be analyzed.

Screen Shot 2020-07-16 at 8 13 01 AM

Anomaly Detector State: Notice that the ReplicaDistributionGoal violation was detected only ONCE - this was when I added a broker earlier. But when I added more again, this was never detected and self-healing did not get triggered.

Screen Shot 2020-07-16 at 8 13 32 AM

Cruise Control Proposals: It seems like cruise control thinks that it has already fixed the ReplicaDistributionGoal violation? However, as the first screenshot shows, self-healing did not get triggered and the cluster did not auto-rebalance. The cruise control logs also did not show any log line for triggering self-healing.

Screen Shot 2020-07-16 at 8 13 48 AM

Configurations

These are the relevant configurations that I have setup cruise control with: self.healing.enabled=true

default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerDiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerEvenRackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal

hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

anomaly.detection.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

efeg commented 4 years ago

Hi @mhaseebmlk

to be sure that auto-rebalancing is working, I scaled down the cluster and then added brokers back to the cluster.

Would you provide more details on how you scaled down the cluster? If you ran a remove_broker command to scale down, CC adds the removed brokers to a recently removed brokers set and refrains from assigning new replicas to them for a configured period of time.

You can verify this by following these steps:

  1. Check the following configs in crusiecontrol.properties and adjust them if needed (here are the current open source configs):

    # True if recently removed brokers are excluded from optimizations during goal violation self healing, false otherwise
    goal.violation.exclude.recently.removed.brokers=true
    ...
    # The maximum time in milliseconds to retain the removal history of brokers.
    removal.history.retention.time.ms=1209600000
  2. Check the executor substate of Cruise Control. If there are recently removed or demoted brokers, you should see them in the response.

efeg commented 4 years ago

@mhaseebmlk To add on https://github.com/linkedin/cruise-control/issues/1270#issuecomment-659609862, note that for REST API requests, you can explicitly set exclude_recently_removed_brokers=false to avoid excluding recently removed brokers. The current default value (i.e. value used if missing) for this parameter is true (see code).

mhaseebmlk commented 4 years ago

Yes, I use the remove_broker endpoint to remove brokers and scale down the cluster.

I verified that the configs you mentioned are set in my configs file too. It also seems like the two brokers that were removed while scaling down are indeed added to the recently removed brokers set: ExecutorState: {state: NO_TASK_IN_PROGRESS, recentlyRemovedBrokers: [4, 5]}

I will try again by adjusting the removal.history.retention.time.ms time to something smaller and verifying that auto rebalancing is working.

mhaseebmlk commented 4 years ago

@efeg that seems to have fixed the problem. Thanks for your help!

efeg commented 4 years ago

@mhaseebmlk Happy to hear that!