linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Weird errors in Cruise Control logs #255

Closed jmarkan closed 5 years ago

jmarkan commented 6 years ago

Hello @TheyDroppedMe / @efeg / @becketqin, I have another issue to report. While going through the Cruise Control logs this morning, I found these entries:

` [2018-06-14 13:18:43,219] WARN Broker 6001 has [bunch of comma-separated topic names here] missing topics metrics and 6 missing partition metrics. (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)

<Similar warnings for all remaining 5 brokers>

[2018-06-14 13:18:43,226] WARN Skip generating broker metric sample for broker 6004 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-06-14 13:18:43,227] WARN Skip generating broker metric sample for broker 6006 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-06-14 13:18:43,227] WARN Skip generating broker metric sample for broker 6005 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-06-14 13:18:43,227] WARN Skip generating broker metric sample for broker 6002 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-06-14 13:18:43,227] WARN Skip generating broker metric sample for broker 6003 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-06-14 13:18:43,227] WARN Skip generating broker metric sample for broker 6001 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-06-14 13:18:43,227] INFO Generated 958 partition metric samples and 0(6 skipped) broker metric samples for timestamp 1528982295034 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-06-14 13:18:43,230] INFO Collected 958 partition metric samples for 958 partitions. Total partition assigned: 958. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) [2018-06-14 13:18:43,230] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) `

I checked to see if messages are consumed by the CC topics, turns out they are, so not sure about these warnings in the logs.

I was planning to test removing a broker using POST to the API, however, I'm reconsidering now.

Could you please suggest any remediation steps to get rid of the above warnings?

Thanks a lot in advance.

jmarkan commented 6 years ago

Looking at - https://github.com/linkedin/cruise-control/issues/236, I tried removing "compact" as the cleanup policy on some of the topics, but that didn't make these warnings disappear.

kidkun commented 6 years ago

the warning

[2018-06-14 13:18:43,219] WARN Broker 6001 has [bunch of comma-separated topic names here] missing topics metrics and 6 missing partition metrics.

means that there are 6 topics([bunch of comma-separated topic names here]) having at least one partition whose leader replica is one broker 6061 but none of these partitions report metric in the time range( this is a message before this warning like "Finished sampling for time range [{},{}]. Collected {} metrics."). it could be either reporter issue that no metric message is written into reporter topic("__CruiseControlMetrics") or cruise control cannot consume.

the following warning

[2018-06-14 13:18:43,227] WARN Skip generating broker metric sample for broker 6006 because the following metrics are missing []

is misleading. this message is generated because we do not have enough partition metric to calculate accurate broker metric(the above issue), but we do have some raw numbers which we get from "__CruiseControlMetrics" topic. so the missing metric set is empty. we need to distinguish inaccurate value from missing value in the message, will have a separate patch for this.

can you check two things first?

  1. the reporter is properly produce message into "__CruiseControlMetrics" topic. you can check the message log's size(still incrementing) and timestamp or use kafka-topics.sh script
  2. check CC is able to consumer the "__CruiseControlMetrics" topic, please check your ACL setting.
jmarkan commented 6 years ago

Hi @kidkun Thanks a lot for your response.

  1. Yes I checked and log's size is incrementing normally.
  2. I checked via kafka-console-consumer that __CruiseControlMetrics topic is being written to. As for ACL, we're not using these in our environment.
jmarkan commented 6 years ago

@kidkun @efeg or @becketqin any suggestions? We keep getting these errors and warnings in the CC logs.

efeg commented 6 years ago

Hi @jmarkan Did you update your copy of cruise-control-metrics-reporter.jar under core/build/dependant-libs-SCALA_VERSION/ after Cruise Control Metrics Reporter started reporting new metrics? (the relevant commit on Mar 16, 2018 -> https://github.com/linkedin/cruise-control/commit/23fc79cb25b2f068b2d480ddddcb4a49962dd66e)

If not, please ensure that you are using the latest cruise-control-metrics-reporter.jar and bounce your broker to ensure that the metrics reporter is producing these metrics to __CruiseControlMetrics.

jmarkan commented 6 years ago

Hi @efeg We're using Confluent Platform's offering of Kafka, so the location of the jar files is not what you mentioned. Its under /usr/share/java/kafka.I validated that the updated jar file is there at this location, however, I remember the file name used to be cruise-control-metrics-reporter.jar. In our case, the file name is cruise-control-metrics-reporter-0.1.0-SNAPSHOT.jar. Does this ring a bell?

I did validate via kafka-console-consumer --bootstrap-server KAFKA_ENDPOINT --topic __CruiseControlMetrics that messages are being written to this topic, however, CC does not seem to be able to see it. I also set cleanup.policy for this topic to be delete, coz when it gets created, it doesn't have this config.

efeg commented 6 years ago

@jmarkan the location of the file, and the fact that the messages are being produced to this topic are all good. I just suspect that if the metrics reporter jar file is outdated, it might not be reporting a certain subset of metrics that are added at the later versions of the metrics reporter.

To ensure that you are using the latest version of the metrics reporter, could you verify that this file, com/linkedin/kafka/cruisecontrol/metricsreporter/metric/RawMetricType.class, exists in the results of the following command:

jar tf /usr/share/java/kafka/cruise-control-metrics-reporter-0.1.0-SNAPSHOT.jar  | less
jmarkan commented 6 years ago

Hey @efeg Yes it does exist in the result of the jar tf command. Here is the full output from one of our brokers:

META-INF/ META-INF/MANIFEST.MF com/ com/linkedin/ com/linkedin/kafka/ com/linkedin/kafka/cruisecontrol/ com/linkedin/kafka/cruisecontrol/metricsreporter/ com/linkedin/kafka/cruisecontrol/metricsreporter/CruiseControlMetricsReporter$1.class com/linkedin/kafka/cruisecontrol/metricsreporter/CruiseControlMetricsReporterConfig.class com/linkedin/kafka/cruisecontrol/metricsreporter/CruiseControlMetricsReporter.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/ com/linkedin/kafka/cruisecontrol/metricsreporter/metric/PartitionMetric.class **com/linkedin/kafka/cruisecontrol/metricsreporter/metric/RawMetricType.class** com/linkedin/kafka/cruisecontrol/metricsreporter/metric/YammerMetricProcessor$Context.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/YammerMetricProcessor.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/CruiseControlMetric.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/BrokerMetric.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricSerde.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricSerde$1.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/TopicMetric.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/RawMetricType$MetricScope.class com/linkedin/kafka/cruisecontrol/metricsreporter/metric/CruiseControlMetric$MetricClassId.class com/linkedin/kafka/cruisecontrol/metricsreporter/exception/ com/linkedin/kafka/cruisecontrol/metricsreporter/exception/UnknownVersionException.class com/linkedin/kafka/cruisecontrol/metricsreporter/exception/CruiseControlMetricsReporterException.class LICENSE NOTICE

Strangely though, I see these logs in brokers as well, contradicting the fact that CC topic is getting populated with the messages:

[2018-06-21 18:33:51,816] WARN Got error produce response with correlation id 54 on topic-partition __CruiseControlMetrics-7, retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)