Open scholzj opened 6 months ago
Triaged on 13/6/2024: let's keep this open and triage it again next call when @scholzj is here, or starting the discussion async here.
Discussed on the community call on 10.7.2024: This makes sense and we should keep this issue.
Today, Strimzi provides the following examples for monitoring:
We call these examples because:
But right now, there seems to be a disconnect between the dashboards / alerts and JMX Prometheus exporter configurations. For example, a small Kafka metric set for a small cluster with only few topics and clients has ~230 metric types and over 6000 metrics. Only small part of that seems to be used in our dashboards. It is similar for a small Connect cluster with over 300 metric types and over 1500 metrics.
The amount of the exported metrics seems to cause several problems:
So I wonder if we should analyze the metrics and export a smaller subset of them in our examples -> in general only the things used in our Dashboards and Alerts. At the end, users can easily customize them if they need additional metrics. Also, if we anyway ignore the metrics and don't use them in dashboards or alerts, exporting them seems to just waste resources not just in our operands, but also in Prometheus servers etc.