linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.68k stars 574 forks source link

Cruise control reporter not recording Kafka metrics #2167

Open BaudoinWR opened 2 weeks ago

BaudoinWR commented 2 weeks ago

I'm running Kafka 3.7.0 with the cruise control reporter 2.5.137 and failing to get any metrics into __CruiseControlMetrics topic outside of the BROKER_CPU_UTIL.

At Kafka startup I can see in the trace logs from the reporter that it's being initialised and given a bunch of KafkaMetrics but none of them gets registered.

2024-06-20 18:52:27 [2024-06-20 17:52:27,109] TRACE Checking Kafka metric MetricName [name=io-wait-ratio, group=registration-metrics, description=*Deprecated* The fraction of time the I/O thread spent waiting, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,112] TRACE Checking Kafka metric MetricName [name=successful-authentication-total, group=heartbeat-metrics, description=The total number of connections with successful authentication, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,112] TRACE Checking Kafka metric MetricName [name=failed-authentication-total, group=heartbeat-metrics, description=The total number of connections with failed authentication, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,112] TRACE Checking Kafka metric MetricName [name=connection-count, group=alter-partition-metrics, description=The current number of active connections., tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,113] TRACE Checking Kafka metric MetricName [name=failed-authentication-rate, group=forwarding-metrics, description=The number of connections with failed authentication per second, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,113] TRACE Checking Kafka metric MetricName [name=successful-reauthentication-rate, group=forwarding-metrics, description=The number of successful re-authentication of connections per second, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,113] TRACE Checking Kafka metric MetricName [name=reauthentication-latency-avg, group=forwarding-metrics, description=The average latency observed due to re-authentication, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,113] TRACE Checking Kafka metric MetricName [name=incoming-byte-rate, group=registration-metrics, description=The number of bytes read off all sockets per second, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,113] TRACE Checking Kafka metric MetricName [name=failed-reauthentication-total, group=heartbeat-metrics, description=The total number of failed re-authentication of connections, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,114] TRACE Checking Kafka metric MetricName [name=failed-reauthentication-rate, group=registration-metrics, description=The number of failed re-authentication of connections per second, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,114] TRACE Checking Kafka metric MetricName [name=network-io-rate, group=alter-partition-metrics, description=The number of network operations (reads or writes) on all connections per second, tags={BrokerId=1}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,114] TRACE Checking Kafka metric MetricName [name=network-io-total, group=socket-server-metrics, description=The total number of network operations (reads or writes) on all connections, tags={listener=PLAINTEXT, networkProcessor=0}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,114] TRACE Checking Kafka metric MetricName [name=iotime-total, group=socket-server-metrics, description=*Deprecated* The total time the I/O thread spent doing I/O, tags={listener=PLAINTEXT, networkProcessor=0}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,114] TRACE Checking Kafka metric MetricName [name=network-io-total, group=socket-server-metrics, description=The total number of network operations (reads or writes) on all connections, tags={listener=PLAINTEXT, networkProcessor=2}] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
.............
2024-06-20 18:52:27 [2024-06-20 17:52:27,238] INFO Added 0 Kafka metrics for Cruise Control metrics during initialization. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,242] INFO KafkaYammerMetrics not found. Metrics will be used. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-20 18:52:27 [2024-06-20 17:52:27,242] INFO Starting Cruise Control metrics reporter with reporting interval of 60000 ms. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)

During the course of running more trace logs are showing up with Checking Kafka metric

The reporter never picks up any new metrics to report and only ends up sending the cpu metrics.

2024-06-21 13:39:41 [2024-06-21 12:39:41,625] DEBUG Reporting metrics for time 1718973581625. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,626] DEBUG Reporting yammer metrics. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,626] DEBUG Finished reporting yammer metrics. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,626] DEBUG Reporting KafkaMetrics. [] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,626] DEBUG Finished reporting KafkaMetrics. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,626] DEBUG Reporting CPU util. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,627] DEBUG Sending Cruise Control metric [BROKER_METRIC,BROKER_CPU_UTIL,time=1718973581625,brokerId=1,value=0.004]. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,628] DEBUG Finished reporting CPU util. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
2024-06-21 13:39:41 [2024-06-21 12:39:41,633] DEBUG Reporting finished for time 1718973581625 in 8 ms. Next reporting time 1718973641625 (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)

Reading from the __CruiseControlMetrics shows that only CPU gets reported

$ /opt/kafka/bin/kafka-console-consumer.sh --topic __CruiseControlMetrics --from-beginning --bootstrap-server localhost:9092 --value-deserializer "com.linkedin.kafka.cruisecontrol.metricsreporter.metric.MetricSerde"
[BROKER_METRIC,BROKER_CPU_UTIL,time=1718955277007,brokerId=1,value=0.003]
[BROKER_METRIC,BROKER_CPU_UTIL,time=1718955337012,brokerId=1,value=0.003]
[BROKER_METRIC,BROKER_CPU_UTIL,time=1718955397019,brokerId=1,value=0.003]
[BROKER_METRIC,BROKER_CPU_UTIL,time=1718955457023,brokerId=1,value=0.003]
[BROKER_METRIC,BROKER_CPU_UTIL,time=1718955517029,brokerId=1,value=0.003]
[BROKER_METRIC,BROKER_CPU_UTIL,time=1718955577033,brokerId=1,value=0.003]
[BROKER_METRIC,BROKER_CPU_UTIL,time=1718955637036,brokerId=1,value=0.003]

This is with a version of Kafka 3.7.0 docker image where the only modifications are adding the cruise control reporter library as well as the config line in server.properties.

metric.reporters=com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter

Can anyone confirm if there is additional configuration that needs to be done on Kafka to get the metrics to show up in the __CruiseControlMetrics topic?