linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Cruise control doesn't auto rebalance even with appropriate goals #906

Closed aravindvs closed 5 years ago

aravindvs commented 5 years ago

Cruise control version: v2.0.59

I am testing out cruise control and noticed that cruise control doesn't auto-rebalance when we scale up the brokers.

These are the steps I did:

  1. Have a 3 node Kafka cluster with constant traffic of ~600msgs/second (just 8 topics overall)
  2. Have cruise control running as well with self.healing turned on and anomaly.notifier.class=com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier
  3. Increase the number of brokers to 6

Expectation:

Reality:

Optimization has 131 inter-broker replica(6689 MB) moves, 0 intra-broker replica(0 MB) moves and 36 leadership moves with a cluster model of 1 recent windows and 100.000% of the partitions covered.

Stats for ReplicaDistributionGoal(FIXED):
AVG:{cpu:       2.246 networkInbound:      25.123 networkOutbound:      25.108 disk:    2325.525 potentialNwOut:      75.306 replicas:94.33333333333333 leaderReplicas:35.0 topicReplicas:9.433333333333334}
MAX:{cpu:       3.100 networkInbound:      29.466 networkOutbound:      62.520 disk:    4651.177 potentialNwOut:      88.265 replicas:103 leaderReplicas:43 topicReplicas:33}
MIN:{cpu:       1.640 networkInbound:      20.783 networkOutbound:       4.437 disk:       0.000 potentialNwOut:      62.349 replicas:85 leaderReplicas:28 topicReplicas:0}
STD:{cpu:       0.486 networkInbound:       4.341 networkOutbound:      20.123 disk:    2325.522 potentialNwOut:      12.957 replicas:8.673074554171793 leaderReplicas:6.4807406984078595 topicReplicas:6.869221668645902

But the executor does nothing:

ExecutorState: {state: NO_TASK_IN_PROGRESS}
AnalyzerState: {isProposalReady: true, readyGoals: [NetworkInboundUsageDistributionGoal, CpuUsageDistributionGoal, PotentialNwOutGoal, NetworkInboundCapacityGoal, LeaderBytesInDistributionGoal, DiskCapacityGoal, ReplicaDistributionGoal, RackAwareGoal, TopicReplicaDistributionGoal, NetworkOutboundCapacityGoal, CpuCapacityGoal, DiskUsageDistributionGoal, NetworkOutboundUsageDistributionGoal, ReplicaCapacityGoal]}
AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, GOAL_VIOLATION=1.0, METRIC_ANOMALY=1.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None}

Questions:

kidkun commented 5 years ago

@aravindvs

  1. CC should auto-rebalance cluster with self-healing turned on. Looks like it is already turned on in your case(from selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY]), but recentGoalViolations field is empty, it means no goal violation detected recently.
  2. self-healing is enabled/disabled per anomaly type, looks like you already turns on for all 4 type. 3.Can you check which goals you set for anomaly.detection.goals config in config/cruisecontrol.properties. I suspect it is because you does to include ReplicaDistribution there, so GoalViolationDetector does not detect for violation of this goal.
aravindvs commented 5 years ago

This is the goals I have in the config:

    anomaly.detection.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

Looks like ReplicaDistribution is indeed not there.. Let me try that and see if it works now..

aravindvs commented 5 years ago

@kidkun - thanks.. This does work now after I updated the goals.. btw the configurations wiki needs to be updated (https://github.com/linkedin/cruise-control/wiki/Configurations).. looks like we need a self.healing.goals to be set properly, if not cruise control was crashing with:

Attempt to configure anomaly detection goals as a superset of self healing goals.

kidkun commented 5 years ago

@aravindvs good to know! I will update the doc this week.

mhaseebmlk commented 4 years ago

@aravindvs were you able to get cruise control to trigger auto rebalance every time brokers are added while scaling up? I found this issue and also added ReplicaDistributionGoal to self.anomaly.detector.goals and noticed the cruise control triggered a cluster rebalance once. I then scaled down the cluster and scaled it back up again, however, this time, cruise control did not trigger an auto rebalance and the two new brokers did not get any partitions assigned to those.

I am wondering if this is an issue with cruise control or something wrong with my configuration. We would expect the auto rebalancing to get triggered every time the ReplicaDistributionGoal is violated but it looks like cruise control is not considering it being violated in this case.

efeg commented 4 years ago

@mhaseebmlk This sounds like a configuration issue. Please see https://github.com/linkedin/cruise-control/issues/1270#issuecomment-659609862.