strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.89k stars 1.31k forks source link

Optimize example configurations of exported metrics against our dashboards and alerts #10188

Open scholzj opened 6 months ago

scholzj commented 6 months ago

Today, Strimzi provides the following examples for monitoring:

We call these examples because:

But right now, there seems to be a disconnect between the dashboards / alerts and JMX Prometheus exporter configurations. For example, a small Kafka metric set for a small cluster with only few topics and clients has ~230 metric types and over 6000 metrics. Only small part of that seems to be used in our dashboards. It is similar for a small Connect cluster with over 300 metric types and over 1500 metrics.

The amount of the exported metrics seems to cause several problems:

So I wonder if we should analyze the metrics and export a smaller subset of them in our examples -> in general only the things used in our Dashboards and Alerts. At the end, users can easily customize them if they need additional metrics. Also, if we anyway ignore the metrics and don't use them in dashboards or alerts, exporting them seems to just waste resources not just in our operands, but also in Prometheus servers etc.

ppatierno commented 5 months ago

Triaged on 13/6/2024: let's keep this open and triage it again next call when @scholzj is here, or starting the discussion async here.

scholzj commented 4 months ago

Discussed on the community call on 10.7.2024: This makes sense and we should keep this issue.