prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.66k stars 2.16k forks source link

Send resolved notification only when all alerts are solved #2644

Open faabsen opened 3 years ago

faabsen commented 3 years ago

Similar to #1403, however, our use-case is different

What did you do?

Alert is launched as severity: warning, then raised to severity: critical but its lowered back to severity: warning.

What did you expect to see?

A resolved notification only after the entire (grouped) alarm is solved.

What did you see instead? Under which circumstances?

Critical alarm sends a resolved status. Therefore, the entire alarm is marked as solved.

Environment

Running with the kubernetes-mixin set (https://github.com/kubernetes-monitoring/kubernetes-mixin)

System information:

Kubernetes EKS 1.17

Alertmanager version:

Branch: HEAD
BuildDate: 20190708-14:31:49
BuildUser: root@868685ed3ed0
GoVersion: go1.12.6
Revision: 1ace0f76b7101cccc149d7298022df36039858ca
Version: 0.18.0

Prometheus version:

Version: 2.11.0
Revision: 4ef66003d9855ed2b7a41e987b33828ec36db34d
Branch: HEAD
BuildUser: root@0dc27cf95f36
BuildDate: 20190709-09:54:35
GoVersion: go1.12.7

Alertmanager configuration file:

...
global:
  resolve_timeout: 5m
receivers:
  - name: system-x
    webhook_configs:
      - url: [...]
        send_resolved: true

inhibit_rules:
- source_match: 
    severity: "critical"
  target_match: 
    severity: "warning"
  equal: [ alertname, name, server, common_name ]

route:
  group_by: [cluster, alertname]

  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: system-x

  routes:

  # Alarm for amq monitoring
  - receiver: system-x
    group_wait: 15s
    group_interval: 30s
    match_re:
      job: telegraf-amq-exporter|telegraf-ibm-wmq-exporter
    group_by: [alertname, name]
    repeat_interval: 3m

Timeline of alerts

11:00 -> Alarm 1 with severity: warning 11:00 -> Recieved notification of Alarm 1 with severity: warning 11:05 -> Alarm 1 raised to severity: critical 11:05 -> Recieved notification of Alarm 1 with severity: critical 11:10 -> Alarm 1 lowered to severity: warning 11:10 -> Recieved resolve of Alarm 1 with severity: critical 11:13 -> Recieved notification of Alarm 1 with severity: warning

Expected behaviour After the entire (grouped) alarm has been cleared, only then the resolve status is being sent.

wHiteeeeeeeee commented 2 years ago

Any updates here?