strimzi / metrics-reporter

Prometheus Metrics Reporter for Apache Kafka server and client components
Apache License 2.0
5 stars 10 forks source link

Apply allowlist filter on metricnames with labels #5

Open mimaison opened 10 months ago

mimaison commented 10 months ago

At the moment, the allowlist is applied on the metric names without the labels. We should allow filtering metrics with specific labels.

mimaison commented 5 months ago

Currently the allowlist configuration is just applied on the metric names and ignores labels. I think it's important to allow filtering on labels too.

The main use cases where this is useful are:

The allowlist configuration accepts a list of regex patterns that are matched against the metrics in the Prometheus format. In the Prometheus format, labels are appended after the metric name between curly brackets, and separated by commas. For example:

kafka_network_requestmetrics_localtimems{request="CreateDelegationToken",quantile="0.999",}

This brings some challenges if we want to also filter labels:

With these issues, I see 2 potential implementations that solve all issues:

  1. Treat curly bracket in a special way and exclude them from the regex logic. For example, if an entry in the allowlist is some_metric_name{mylabel="my-value"}, the reporter will first extract and parse the text in between curly brackets. It will then only apply some_metric_name as a regex to filter metrics, and finally only keep metrics that have the labels provided.

    • Pros: Configuration stays inline. This will simplify making it dynamically reconfigurable at runtime in the future.
    • Cons: Users are not able to use {} for regex matching.
  2. Change the way users provide their metrics filter. One option is to do it via a configuration file. For example in YAML:

    - some_metric_name
      labels:
        - mylabel: "my-value"
    • Pros: Users can use complex regex including {}. YAML allows pretty much any characters if a key is surrounded by quotes.
    • Cons: Need to provide an additional file.

@k-wall @OwenCorrigan76 @scholzj @tombentley

antonio-pedro99 commented 5 months ago

@mimaison I think [2] would be enough to handle also other label-matching operators. Therefore, [2] would be the way to go. But few things to take into consideration if we are extracting and then parsing the text inside the {}:

mimaison commented 5 months ago

The goal is not to re-implement PromQL. We just need a basic filtering mechanism that supports labels too. I think we should really keep it simple as I see the extreme configurability of jmx_exporter one of its drawbacks. It allows writing filters that are hard to understand and ultimately expensive to compute making it pretty slow if you enable too many metrics.

k-wall commented 5 months ago

Do we really need this feature in the reporter? Do we really need it right now? Prometheus already has the ability to drop metrics at ingress time, so if the volume of metrics coming from the reporter was truly bothersome for a user, they'd already have that knob to twiddle. (Aside: doesn't this go beyond the scope of the original Strimzi Proposal?)

I'd say let's wait on this one (YAGNI). If we get user feedback saying this is important, then we can think then with a fresh proposal.

antonio-pedro99 commented 5 months ago

The goal is not to re-implement PromQL. We just need a basic filtering mechanism that supports labels too. I think we should really keep it simple as I see the extreme configurability of jmx_exporter one of its drawbacks. It allows writing filters that are hard to understand and ultimately expensive to compute making it pretty slow if you enable too many metrics.

I agree with you.

antonio-pedro99 commented 5 months ago

@mimaison what's the update of this? Are we following @k-wall's suggestion? Otherwise it would be interesting to pickup this issue

mimaison commented 5 months ago

@antonio-pedro99 Thanks for volunteering, however the first step is to decide what to do. If you have opinions on the various points being discussed, please share them.

mimaison commented 5 months ago

Following the update of the Prometheus client https://github.com/strimzi/metrics-reporter/commit/4c1fd9e85774f64a4df406062d124269e7093e50, a trailing commas is no longer injected after the last label.

I think it would be worth checking whether we could change the regex separator (maybe to ;) and see how a filter with labels (so escaping the various conflicting characters) would look like.

mimaison commented 4 months ago

I spent some time investigating our various options. The "best" (least worst?) solution I found was to allow configuring the separator for the allowlist and have logic to mimic the Prometheus metric name serialization to compute the expected final name with labels.

Here is a PoC: https://github.com/strimzi/metrics-reporter/compare/main...mimaison:metrics-reporter:label_filter?expand=1

This allows users to include labels in their allowlist. For example using the following allowlist:

prometheus.metrics.reporter.allowlist=kafka_server_kafka_network_requestmetrics_totaltimems\\{request="(Produce|Fetch)".*

I only get these metrics:

# TYPE kafka_server_kafka_network_requestmetrics_totaltimems summary
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.5"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.75"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.95"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.98"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.99"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.999"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems_count{request="Fetch"} 0
kafka_server_kafka_network_requestmetrics_totaltimems_sum{request="Fetch"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.5"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.75"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.95"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.98"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.99"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.999"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems_count{request="Produce"} 0
kafka_server_kafka_network_requestmetrics_totaltimems_sum{request="Produce"} 0.0

In the end, it's ok but not a great. The main issues are:

Also while doing this work I identified https://github.com/strimzi/metrics-reporter/issues/30

k-wall commented 4 months ago

prometheus.metrics.reporter.allowlist=kafka_server_kafka_network_requestmetrics_totaltimems\{request="(Produce|Fetch)".*

is label order in the serialised form of the metric something that's guaranteed?

mimaison commented 4 months ago

Labels are sorted by name. This is mentioned in the javadoc: https://prometheus.github.io/client_java/api/io/prometheus/metrics/model/snapshots/Labels.html

I assume the exposition formats preserve that.