Cruise control doesn't auto rebalance even with appropriate goals

aravindvs commented 5 years ago

Cruise control version: v2.0.59

I am testing out cruise control and noticed that cruise control doesn't auto-rebalance when we scale up the brokers.

These are the steps I did:

Have a 3 node Kafka cluster with constant traffic of ~600msgs/second (just 8 topics overall)
Have cruise control running as well with self.healing turned on and anomaly.notifier.class=com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier
Increase the number of brokers to 6

Expectation:

Cruise control detects 'ReplicaDistribution' goal is violated, proposes replica movement and executes rebalance and things are automatically distributed among all the brokers (including new).

Reality:

I do see proposals from cruise control but unless I manually run rebalance, the new brokers don't get any partitions.

Optimization has 131 inter-broker replica(6689 MB) moves, 0 intra-broker replica(0 MB) moves and 36 leadership moves with a cluster model of 1 recent windows and 100.000% of the partitions covered.

Stats for ReplicaDistributionGoal(FIXED):
AVG:{cpu:       2.246 networkInbound:      25.123 networkOutbound:      25.108 disk:    2325.525 potentialNwOut:      75.306 replicas:94.33333333333333 leaderReplicas:35.0 topicReplicas:9.433333333333334}
MAX:{cpu:       3.100 networkInbound:      29.466 networkOutbound:      62.520 disk:    4651.177 potentialNwOut:      88.265 replicas:103 leaderReplicas:43 topicReplicas:33}
MIN:{cpu:       1.640 networkInbound:      20.783 networkOutbound:       4.437 disk:       0.000 potentialNwOut:      62.349 replicas:85 leaderReplicas:28 topicReplicas:0}
STD:{cpu:       0.486 networkInbound:       4.341 networkOutbound:      20.123 disk:    2325.522 potentialNwOut:      12.957 replicas:8.673074554171793 leaderReplicas:6.4807406984078595 topicReplicas:6.869221668645902

But the executor does nothing:

ExecutorState: {state: NO_TASK_IN_PROGRESS}
AnalyzerState: {isProposalReady: true, readyGoals: [NetworkInboundUsageDistributionGoal, CpuUsageDistributionGoal, PotentialNwOutGoal, NetworkInboundCapacityGoal, LeaderBytesInDistributionGoal, DiskCapacityGoal, ReplicaDistributionGoal, RackAwareGoal, TopicReplicaDistributionGoal, NetworkOutboundCapacityGoal, CpuCapacityGoal, DiskUsageDistributionGoal, NetworkOutboundUsageDistributionGoal, ReplicaCapacityGoal]}
AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, GOAL_VIOLATION=1.0, METRIC_ANOMALY=1.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None}

Questions:

Is this expected? Is cruise control expected to auto-rebalance with self-healing turned on?
Will self-healing turn on only when the brokers fail?
Is there any other config knob that is missing which will help cluster auto-scale?

kidkun commented 5 years ago

@aravindvs

CC should auto-rebalance cluster with self-healing turned on. Looks like it is already turned on in your case(from selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY]), but recentGoalViolations field is empty, it means no goal violation detected recently.
self-healing is enabled/disabled per anomaly type, looks like you already turns on for all 4 type. 3.Can you check which goals you set for anomaly.detection.goals config in config/cruisecontrol.properties. I suspect it is because you does to include ReplicaDistribution there, so GoalViolationDetector does not detect for violation of this goal.

aravindvs commented 5 years ago

This is the goals I have in the config:

    anomaly.detection.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

Looks like ReplicaDistribution is indeed not there.. Let me try that and see if it works now..

aravindvs commented 5 years ago

@kidkun - thanks.. This does work now after I updated the goals.. btw the configurations wiki needs to be updated (https://github.com/linkedin/cruise-control/wiki/Configurations).. looks like we need a self.healing.goals to be set properly, if not cruise control was crashing with:

Attempt to configure anomaly detection goals as a superset of self healing goals.

kidkun commented 5 years ago

@aravindvs good to know! I will update the doc this week.

mhaseebmlk commented 4 years ago

@aravindvs were you able to get cruise control to trigger auto rebalance every time brokers are added while scaling up? I found this issue and also added ReplicaDistributionGoal to self.anomaly.detector.goals and noticed the cruise control triggered a cluster rebalance once. I then scaled down the cluster and scaled it back up again, however, this time, cruise control did not trigger an auto rebalance and the two new brokers did not get any partitions assigned to those.

I am wondering if this is an issue with cruise control or something wrong with my configuration. We would expect the auto rebalancing to get triggered every time the ReplicaDistributionGoal is violated but it looks like cruise control is not considering it being violated in this case.

efeg commented 4 years ago

@mhaseebmlk This sounds like a configuration issue. Please see https://github.com/linkedin/cruise-control/issues/1270#issuecomment-659609862.

linkedin / cruise-control

Cruise control doesn't auto rebalance even with appropriate goals #906

Questions: