prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.57k stars 2.14k forks source link

Alertmanager does not send resolve for a few cases randomly #2675

Open M0rdecay opened 3 years ago

M0rdecay commented 3 years ago

The original case is here - https://github.com/prometheus/alertmanager/issues/2398 After upgrade to 0.22.2, such situations have become much fewer, but sometimes resolved messages still do not come. Typically, problems begin after uptime at 24+ hours.

The load on the alertManagers is very low - 25-30 requests/min.

Environment

First node:

global:
  resolve_timeout: 5m
  smtp_require_tls: false
  smtp_from: from@local
  smtp_smarthost: smarthost:25

route: ## top-level tree node with base parameters
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'integration_receiver'
  group_by: ['...']

receivers:
- name: 'integration_receiver'
  email_configs:
  - to: to@local
    send_resolved: true
    headers:
      subject: "{{ .CommonLabels.product }}:{{ .CommonLabels.appl }}:{{ .CommonLabels.alertname }}"
    html: ''
    text: |+
      {
        "appl": "{{ .CommonLabels.appl }}",
        "appl_instance": "{{ .CommonLabels.stand_name }}",
        "appl_product_group": "{{ .CommonLabels.product }}",
        "host": "{{ .CommonLabels.host }}",
        "field": "{{ .CommonLabels.alertname }}",
        "time": "{{ (index .Alerts 0).StartsAt }}",
        {{ if eq .Status "resolved" }}"level": "OK",{{ end -}}
        {{ if eq .Status "firing" }}"level": {{ if eq .CommonLabels.severity "application/container" }}"Warning"{{ else }}"Critical"{{ end }},{{ end }}
        "value": "",
        "prmt_object": "{{- range $k, $v := .CommonLabels -}}{{- if and (ne $k "host") (ne $k "alertname") (ne $k "scope") (ne $k "severity") (ne $k "appl") (ne $k "product") (ne $k "stand_name") -}}{{ $k }}={{ $v | js }};{{- end -}}{{- end -}}"
      }
      Fingerprint: {{ (index .Alerts 0).Fingerprint }}

      Description:
      {{ .CommonAnnotations.description }}

      Links:
      - AlertManager - {{ .ExternalURL }}
      - Grafana panel - {{ (index .Alerts 0).GeneratorURL }}

Second node:

global:
  resolve_timeout: 5m
  smtp_require_tls: false
  smtp_from: from@local
  smtp_smarthost: smarthost:25

route: ## top-level tree node with base parameters
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [ stand_name, scope, severity, product ]
  receiver: 'human_common_receiver'
  routes:
  - match: ## area alerts
      scope: area
    continue: true
    routes:
    - match: # app/container level
        severity: "application/container"
      receiver: 'human_no_resolved_receiver'
  - match: ## product alerts
      scope: application
    continue: true
    routes:
    - match: # app/container level
        severity: "application/container"
      receiver: 'human_no_resolved_receiver'
  - match: ## business alerts
      scope: business

receivers:
- name: 'human_common_receiver'
  email_configs:
  - to: another_to@local
    send_resolved: true

- name: 'human_no_resolved_receiver'
  email_configs:
  - to: another_to@local
    send_resolved: false

Each node running in docker (we using Bitnami images):

docker run -d \
--name=alertmanager-0.22.2 \
--log-driver=json-file \
--log-opt "max-size=100m" \
--log-opt "max-file=10" \
-p 9095:9093 \
-p 9094:9094/tcp \
-p 9094:9094/udp \
-v /data/alertmanager/config/:/opt/bitnami/alertmanager/conf/ \
-v /data/alertmanager/alertdata/:/opt/bitnami/alertmanager/data/ \
bitnami/alertmanager:0.22.2-debian-10-r55 \
--config.file=/opt/bitnami/alertmanager/conf/config.yml \
--storage.path=/opt/bitnami/alertmanager/data \
--web.external-url=http://alertmanager:9095 \
--cluster.peer=node1:9094 \
--cluster.peer=node2:9094
jividijuvent commented 2 years ago

Hi, did you find any solution? I have a worse problem, sending resolved does not work at all. I am using webhook.