linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Missing metrics in topics #236

Closed jlisam closed 6 years ago

jlisam commented 6 years ago

After starting up cruise control, I get the following errors after every metric fetch:

[2018-05-24 21:02:12,184] ERROR Error building partition metric sample for __KafkaCruiseControlPartitionMetricSamples-20 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
java.lang.IllegalArgumentException: Broker metric ALL_TOPIC_REPLICATION_BYTES_OUT does not exist.
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor$BrokerLoad.brokerMetric(CruiseControlMetricsProcessor.java:374)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor$BrokerLoad.access$300(CruiseControlMetricsProcessor.java:296)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.buildPartitionMetricSample(CruiseControlMetricsProcessor.java:255)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.addPartitionMetricSamples(CruiseControlMetricsProcessor.java:126)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.process(CruiseControlMetricsProcessor.java:87)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.getSamples(CruiseControlMetricsReporterSampler.java:125)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchSamples(SamplingFetcher.java:105)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchMetricsForAssignedPartitions(SamplingFetcher.java:85)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:24)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:16)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2018-05-24 21:02:12,184] ERROR Error building partition metric sample for __consumer_offsets-2 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
java.lang.IllegalArgumentException: Broker metric ALL_TOPIC_REPLICATION_BYTES_OUT does not exist.
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor$BrokerLoad.brokerMetric(CruiseControlMetricsProcessor.java:374)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor$BrokerLoad.access$300(CruiseControlMetricsProcessor.java:296)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.buildPartitionMetricSample(CruiseControlMetricsProcessor.java:255)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.addPartitionMetricSamples(CruiseControlMetricsProcessor.java:126)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.process(CruiseControlMetricsProcessor.java:87)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.getSamples(CruiseControlMetricsReporterSampler.java:125)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchSamples(SamplingFetcher.java:105)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchMetricsForAssignedPartitions(SamplingFetcher.java:85)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:24)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:16)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

[2018-05-24 21:02:12,185] WARN Skip generating broker metric sample for broker 1001 because the following metrics are missing [ALL_TOPIC_REPLICATION_BYTES_IN, ALL_TOPIC_REPLICATION_BYTES_OUT] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2018-05-24 21:02:12,185] WARN Skip generating broker metric sample for broker 1002 because the following metrics are missing [ALL_TOPIC_REPLICATION_BYTES_IN, ALL_TOPIC_REPLICATION_BYTES_OUT] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2018-05-24 21:02:12,185] WARN Skip generating broker metric sample for broker 1003 because the following metrics are missing [ALL_TOPIC_REPLICATION_BYTES_IN, ALL_TOPIC_REPLICATION_BYTES_OUT] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2018-05-24 21:02:12,185] INFO Generated 0(122 skipped) partition metric samples and 0(3 skipped) broker metric samples for timestamp 1527195696799 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2018-05-24 21:02:12,185] INFO Collected 0 partition metric samples for 0 partitions. Total partition assigned: 122. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2018-05-24 21:02:12,185] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)
[2018-05-24 21:02:12,185] INFO Finished sampling in 264 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2018-05-24 21:02:12,536] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2018-05-24 21:02:42,537] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2018-05-24 21:03:12,537] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2018-05-24 21:03:42,537] INFO Skipping best proposal precomputing because load monitor does not have enough snapshots. (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
[2018-05-24 21:04:11,921] INFO Kicking off sampling for time range [1527195731921, 1527195851921], duration 120000 ms using 1 fetchers with timeout 120000 ms. (com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager)
[2018-05-24 21:04:12,176] INFO Finished sampling for time range [1527195731921,1527195851921]. Collected 1012 metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler)
[2018-05-24 21:04:12,176] WARN Broker 1001 has [s****] missing topics metrics and 1 missing partition metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2018-05-24 21:04:12,176] WARN Broker 1002 has [s****] missing topics metrics and 2 missing partition metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2018-05-24 21:04:12,176] WARN Broker 1003 has [s****] missing topics metrics and 1 missing partition metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
[2018-05-24 21:04:12,177] ERROR Error building partition metric sample for __consumer_offsets-13 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)

I am not entirely sure what's going on (I probably need to read the code for a more depth understanding). I also see that these errors span across all the existing topics in the cluster (3 brokers). Happy to provide more info if needed. Thank you so much!

jlisam commented 6 years ago

I think I found out why. It's because of compact clean up policy. I had old topics that were configured that way and never excluded. I haven't spent a ton of time investigating but let me know if this makes sense.

jlisam commented 6 years ago

Closing this. Turns out it's because of the compact topics