strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.8k stars 1.28k forks source link

[Bug]: Alertmanager rules fire for "There is no messages in topic" for 10 minutes for cruisecontrol and other built-in topics #9264

Closed jbnjohnathan closed 11 months ago

jbnjohnathan commented 11 months ago

Bug Description

When applying the default Alertmanager rules from https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/prometheus-install/prometheus-rules.yaml there are a lot of false positives from the topics created by strimzi. For example:

There is no messages in topic __strimzi_store_topic/partition 0 for 10 minutes
There is no messages in topic __strimzi-topic-operator-kstreams-topic-store-changelog/partition 0 for 10 minutes
There is no messages in topic strimzi.cruisecontrol.partitionmetricsamples/partition 11 for 10 minutes
There is no messages in topic strimzi.cruisecontrol.partitionmetricsamples/partition 4 for 10 minutes
There is no messages in topic strimzi.cruisecontrol.partitionmetricsamples/partition 0 for 10 minutes
... etc
There is no messages in topic strimzi.cruisecontrol.modeltrainingsamples/partition 21 for 10 minutes
There is no messages in topic strimzi.cruisecontrol.modeltrainingsamples/partition 15 for 10 minutes
... etc

Steps to reproduce

  1. Deploy kafka with strimzi
  2. Apply the prometheus rules from https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/prometheus-install/prometheus-rules.yaml
  3. Check the alerts from alertmanager

Expected behavior

If some built-in topics are not expected to be written to regulary they should be excluded from the prometheus rules, just like the topis matching __consumer_offsets are now

Strimzi version

0.35.1

Kubernetes version

v1.25.4

Installation method

Helm

Infrastructure

OpenShift

Configuration files and logs

No response

Additional context

No response

scholzj commented 11 months ago

Discussed on the Community call on 19.10.2023: The Prometheus Alert Manager rules provide examples. You can modify them in any way or remove / disable the rule if you want.

scholzj commented 11 months ago

Discussed on the Community call on 19.10.2023 for the second time: After further discussion, it seems it does not make sense to exclude only __consumer_topics and not for example transaction state. We should either exclude all of Kafka's internal topics (consumer offsets, transaction state). Or keep all of them included. @mimaison will have a look at it.

The other topics from the Topic Operator or the Cruise Control are regular topics as any other. Whertehr no messsages for them should be an alert or not depends on the exact use-case and situation as with any other topics. So they should not be excluded.