linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

trained stopped at 20% #373

Closed craynic closed 5 years ago

craynic commented 5 years ago

I have set up a Kafka cluster with two brokers on two vms. I have cruise-control running for hours. I keep looking at /kafkacruisecontrol/state. It seems good at beginning. The trained percent kept a rise in first few hours, but it stopped at 20.000%.

MonitorState: {state: RUNNING(20.000% trained), NumValidWindows: (1/1) (100.000%), NumValidPartitions: 150/150 (100.000%), flawedPartitions: 0}

and cruise-control gives logs:

[2018-10-23 18:06:29,909] WARN Skip generating broker metric sample for broker 0 because there are not enough topic metrics to generate broker metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-10-23 18:06:29,909] INFO Generated 150 partition metric samples and 1 broker metric samples for timestamp 1540289170232 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-10-23 18:06:29,909] INFO PARTITION Aggregator rolled out 1 new windows, reset 1 windows, current window range [1540289100000, 1540289400000], abandon 300 samples. (com.linkedin.cruisecontrol.monitor.sampling.aggregator.MetricSampleAggregator) [2018-10-23 18:06:29,909] INFO Collected 150 partition metric samples for 150 partitions. Total partition assigned: 150. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) [2018-10-23 18:06:29,910] INFO BROKER Aggregator rolled out 1 new windows, reset 1 windows, current window range [1540283400000, 1540289400000], abandon 3 samples. (com.linkedin.cruisecontrol.monitor.sampling.aggregator.MetricSampleAggregator) [2018-10-23 18:06:29,910] INFO Collected 1 broker metric samples for 1 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) [2018-10-23 18:06:29,933] INFO Finished sampling in 37 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager) [2018-10-23 18:06:41,345] WARN Utilization for broker ids:[0] is above the balance limit for:disk after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,345] WARN Utilization for broker ids:[1] is under the balance limit for:disk after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,348] WARN Utilization for broker ids:[0] is above the balance limit for:networkInbound after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,349] WARN Utilization for broker ids:[1] is under the balance limit for:networkInbound after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,354] WARN Utilization for broker ids:[0] is above the balance limit for:networkOutbound after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,354] WARN Utilization for broker ids:[1] is under the balance limit for:networkOutbound after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,359] WARN Utilization for broker ids:[0] is above the balance limit for:cpu after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,360] WARN Utilization for broker ids:[1] is under the balance limit for:cpu after rebalance. (com.linkedin.kafka.cruisecontrol.analyzer.goals.ResourceDistributionGoal) [2018-10-23 18:06:41,360] WARN There were still 1 brokers over the limit. (com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal) [2018-10-23 18:06:41,361] INFO Finished the precomputation proposal candidates in 32 ms (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [2018-10-23 18:08:29,894] INFO Kicking off partition metric sampling for time range [1540289189896, 1540289309894], duration 119998 ms using 1 fetchers with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager) [2018-10-23 18:08:29,904] INFO Finished sampling for topic partitions [__CruiseControlMetrics-0] in time range [1540289189896,1540289309894]. Collected 666 metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler)

It seems that it has noticed that there are two brokers and they are not at balance. Could there be some hints on what the trained percent means?

craynic commented 5 years ago

Also I could see a slow growing on topic __KafkaCruiseControlModelTrainingSamples.

craynic commented 5 years ago

seems very similar to #318

craynic commented 5 years ago

I deleted the topic __confluent.support.metrics and set confluent.support.metrics.enable=false in Kafka server.properties. And now sampling seems good. This topic behave strange. It cannot be listed and it does not have metrics. It is used for Confluent metrics. If you have warning

[2018-10-23 14:00:29,919] WARN Broker 0 has 1/9 missing topics metrics and 1/134 missing partition metrics. Missing topics: [__confluent.support.metrics]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)

you could have a try on this.

efeg commented 5 years ago

@craynic Do you have any client traffic in/out of topics in your cluster? -- i.e. any consumer getting data from topics or any producers sending data to your topics? It looks like there is no movement in your cluster (e.g. maybe this is a quick cluster that you started to test things, but it gets no real traffic); hence, your metrics reporter has not much to report. Thus the metric sampler of Cruise Control gets little to no data from __CruiseControlMetrics -- could you verify that?

You can ignore the value of trained -- it has no impact on the current readiness of the goals (we should probably move it to verbose response of the monitor substate).

craynic commented 5 years ago

@efeg Thanks. I have fixed this problem. The topic __confluent.support.metrics has no matric data to CC, so CC ignore all the data on that broker, while all my traffic is on that broker.