prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.44k stars 2.12k forks source link

AlertManager doesn't send some of alerts to Pagerduty #3902

Open De4dGho5t opened 1 week ago

De4dGho5t commented 1 week ago

I got those erros on some of alerts in alertmanger

ts=2024-06-26T14:45:32.988Z caller=dispatch.go:353 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=3 err="pagerduty-notifications/pagerduty[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: Event object is invalid: 'payload.severity' is invalid (must be one of the following: 'critical', 'warning', 'error' or 'info')" Because log doesn't say which alert are problematic, for compering I will show screenshot from pagerduty and alertmanager UI

Screenshot from 2024-06-26 16-55-09

Screenshot from 2024-06-26 16-57-32

Like you see CPUThrottlingHigh and KubePersistentVolumeFillingUp didn't show up in PD

Those are all alerts from alertmanager api: alerts.json Log is saying that severity is invalid, but all alerts have correct values of severity.

Currently I'm using kube-prometheus-stack in version: 60.2.0, so alert manger version is: quay.io/prometheus/alertmanager:v0.27.0

Alertmanger logs with debug enabled: alertmanager.log

My alertmanager config:

global:
  resolve_timeout: 1m
route:
  receiver: pagerduty-notifications
  group_by:
  - alertname
  routes:
  - receiver: "null"
    matchers:
    - alertname =~ "InfoInhibitor|Watchdog"
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 10m
inhibit_rules:
- target_matchers:
  - severity =~ warning|info
  source_matchers:
  - severity = critical
  equal:
  - namespace
  - alertname
- target_matchers:
  - severity = info
  source_matchers:
  - severity = warning
  equal:
  - namespace
  - alertname
- target_matchers:
  - severity = info
  source_matchers:
  - alertname = InfoInhibitor
  equal:
  - namespace
  - alertname
receivers:
- name: "null"
- name: pagerduty-notifications
  pagerduty_configs:
  - send_resolved: true
    routing_key: key
    severity: '{{ range .Alerts }}{{ .Labels.severity | toLower }}{{ end }}'
templates:
- /etc/alertmanager/config/*.tmpl
grobinson-grafana commented 1 week ago

Hi! 👋 There are two errors in the template here:

  1. If an alert is missing the severity label then the severity will be empty. This happens because the template expects one to be present and doesn't set a default if the label is missing.

  2. You are writing all the severities for all alerts in a single Pagerduty incident. For example criticalcriticalcritical, but Pagerduty only accepts one severity per incident.

De4dGho5t commented 1 week ago

thank you for quick answers :

so for Ad.1 I could add some default value to severity: '{{ range .Alerts }}{{ .Labels.severity | toLower }}{{ end }}' to set always some kind of severity, this is good way to resolve it

but for Ad.2 do you have some suggestion how can I fix it ?

grobinson-grafana commented 1 week ago

but for Ad.2 do you have some suggestion how can I fix it ?

You should be able to use something like this https://github.com/prometheus/alertmanager/pull/3847#issuecomment-2133108415. You'll need to adapt it a little for Pagerduty, but otherwise it should solve the issue.