add test to cover duplicate alert names

kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.

Apache License 2.0

2.11k stars 598 forks source link

add test to cover duplicate alert names #384

Open paulfantom opened 4 years ago

paulfantom commented 4 years ago

Currently (as of https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/bf3064885199f90080bec6790f2d27c5ad08184d) there are alerts with the same name but different expressions (ex. https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/resource_alerts.libsonnet#L26 and https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/resource_alerts.libsonnet#L62). It could be prevented by parsing output of promtool and failing when detected duplicates or modifying upstream promtool to fail when duplicates are detected. WDYT?

brancz commented 4 years ago

Alerts with the same name and different threshold to decide the severity is quite common, I don't think that's a generally applicable rule.

paulfantom commented 4 years ago

Alerts with the same name and different threshold to decide the severity is quite common

For some reason, promtool doesn't treat those as duplicates.

povilasv commented 4 years ago

As far as I know we don't treat alerts with same name as duplicates, cause they are not.

On example:

  - alert: KubeAPIErrorBudgetBurn
    annotations:
      message: The API server is burning too much error budget
    expr: |
      sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)
      and
      sum(apiserver_request:burnrate5m) > (14.40 * 0.01000)
    for: 2m
    labels:
      severity: critical
  - alert: KubeAPIErrorBudgetBurn
    annotations:
      message: The API server is burning too much error budget
    expr: |
      sum(apiserver_request:burnrate6h) > (6.00 * 0.01000)
      and
      sum(apiserver_request:burnrate30m) > (6.00 * 0.01000)
    for: 15m
    labels:
      severity: critical

This is completely ok and works in Prometheus/ Alertmanager.

This is what we used in Thanos to deduplicate alerts -> https://github.com/thanos-io/thanos/pull/2263/files#diff-5e861d5245f0514ca397e835ef86ca05R62-R69

povilasv commented 4 years ago

Actually promtool complains:

promtool check rules prometheus_alerts.yaml
Checking prometheus_alerts.yaml
2 duplicate rules(s) found.
Metric: KubeAPIErrorBudgetBurn
Label(s):
        severity: critical
Metric: KubeAPIErrorBudgetBurn
Label(s):
        severity: warning

metalmatze commented 4 years ago

Has anyone followed up with promtool upstream?