Optimize example configurations of exported metrics against our dashboards and alerts

scholzj commented 6 months ago

Today, Strimzi provides the following examples for monitoring:

Example configurations of the Metric exports for Prometheus
Example Grafana dashboards
Example alerts

We call these examples because:

They are not the only way how to do them and there is no single correct way how to monitor Kafka
They are customizable and users are encouraged to tailor them to their own needs.

But right now, there seems to be a disconnect between the dashboards / alerts and JMX Prometheus exporter configurations. For example, a small Kafka metric set for a small cluster with only few topics and clients has ~230 metric types and over 6000 metrics. Only small part of that seems to be used in our dashboards. It is similar for a small Connect cluster with over 300 metric types and over 1500 metrics.

The amount of the exported metrics seems to cause several problems:

To what extent are they considered an API? In most cases, we are likely to notice when the Grafana dashboard will stop working. However, in most cases, we will not notice any changes to the metrics not used in the dashboards. While we might not consider them an API and users can easily fix the exports, it seems to create unnecessary issues.
It creates maintenance effort such as in #10184 where seeing and addressing the changes is much harder with thousands of metrics.
We do not know the meaning of most of the metrics anyway. So, it is sometimes hard to argue if the metric should be really a counter or a gauge etc. SImilarly, we export them, but we are not really able to use them, advice our users to use them etc.
Complaints about performance issues for the JMX Exporter are quite common. While the new 1.0 version promises some performance improvements, maybe exporting a smaller set of metrics out of the box might be helpful here as well?

So I wonder if we should analyze the metrics and export a smaller subset of them in our examples -> in general only the things used in our Dashboards and Alerts. At the end, users can easily customize them if they need additional metrics. Also, if we anyway ignore the metrics and don't use them in dashboards or alerts, exporting them seems to just waste resources not just in our operands, but also in Prometheus servers etc.

ppatierno commented 5 months ago

Triaged on 13/6/2024: let's keep this open and triage it again next call when @scholzj is here, or starting the discussion async here.

scholzj commented 4 months ago

Discussed on the community call on 10.7.2024: This makes sense and we should keep this issue.

strimzi / strimzi-kafka-operator

Optimize example configurations of exported metrics against our dashboards and alerts #10188