linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Prometheus metrics sampler not pickup up on topics with dots #1468

Closed dhuang closed 3 years ago

dhuang commented 3 years ago

I've been testing out the Prometheus metrics sampler and so far it seems to work as expected, but noticed it's failing to pick up a subset of topics. I suspect it has to do with use of dots (.) in the topic names, thus probably related to https://github.com/linkedin/cruise-control/pull/443. Sounds like the default sampler may be doing some implicit conversion of these dots that requires this logic, but I don't think that's happening in the Prometheus sampler, so there's a mismatch later on?

This is the warning I was seeing that clued me in, since all the topics listed actually had dots instead of underscores.

WARN Broker 2 is missing 2/9 topics metrics and 2/53 leader partition metrics. Missing leader topics: [a_topic, another_topic]. (com.linkedin.kafka.cruisecontrol.monitor.sampling.holder.BrokerLoad)

I may also be able to help with fixing or testing, but at least wanted to make sure I'm on the right track here before taking a deeper look.

efeg commented 3 years ago

@dhuang Thanks for reporting the issue and sorry for the delayed response!

This indeed seems likely to be an issue with conversion of topic names with dots in the Prometheus metric sampler. In particular, I suspect that in PrometheusMetricSampler#addPartitionMetrics, before calling addMetricForProcessing, the topic should be converted to a dotHandledTopic, which is missing -- this needs testing.

Please feel free to create a PR with the relevant tests -- we would appreciate your contribution.

baganokodo commented 3 years ago

I experienced the same issue on an AWS MSK cluster. MSK seems only support Prometheus metric sampler according to their docs.

hd-rk commented 3 years ago

Encountered this as well in our deployment. Raised https://github.com/linkedin/cruise-control/pull/1495 with tests for the fix. @efeg can you help take a look? Thanks.