prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.57k stars 2.14k forks source link

Alert not send #2048

Open roidelapluie opened 4 years ago

roidelapluie commented 4 years ago

What did you do?

We have an alert with 3 webhook receivers.

The alert was sent yesterday evening, then again today.

But the alerts of today were not sent to the webhook receivers.

What did you expect to see?

Calls to webhooks yesterday and today.

What did you see instead? Under which circumstances?

The webhooks were only called yesterday.

Environment

*

Deserialization of NFLOG during the second incident (where we did not receive notifications):

Entry: {}/{recipient=~"^(?:(.*,)?appteam/ticket(,.*)?)$"}/{repeat_interval=""}:{alertid="XXX-00014", customer_name="XXX", env="prod", hostname="XXX-mtprd01", priority="P1", recipient="appteam/ticket,XXX/circuit", title="JVM Down"}:appteam/ticket/webhook/0
{
  "entry": {
    "groupKey": "XXX",
    "receiver": {
      "groupName": "appteam/ticket",
      "integration": "webhook"
    },
    "timestamp": "2019-09-24T15:53:55.222089917Z",
    "firingAlerts": [
      "8684883655238988612"
    ]
  },
  "expiresAt": "2019-09-29T15:53:55.222089917Z"
}

Deserialization AFTER today event is resolved:

Entry: {}/{recipient=~"^(?:(.*,)?appteam/ticket(,.*)?)$"}/{repeat_interval=""}:{alertid="XXX-00014", customer_name="XXX", env="prod", hostname="XXX-mtprd01", priority="P1", recipient="appteam/ticket,XXX/circuit", title="JVM Down"}:appteam/ticket/webhook/0
{
  "entry": {
    "groupKey": "XXX",
    "receiver": {
      "groupName": "appteam/ticket",
      "integration": "webhook"
    },
    "timestamp": "2019-09-25T15:20:55.074281100Z",
    "resolvedAlerts": [
      "8684883655238988612"
    ]
  },
  "expiresAt": "2019-09-30T15:20:55.074281100Z"
}

The timestamp of the first one is the BEGINNING of the first event. The timestamp of the second one is the END of the second event.

I would have expected the first one to be the END of the first event?

roidelapluie commented 4 years ago

Picture of the first incident in prometheys (pending then firing)

pro

roidelapluie commented 4 years ago

From a backup between the incidents (24th 20h). we see that the alert is still there (but it was no longer firing in prometheus)

Entry: {}/{recipient=~"^(?:(.*,)?appteam/ticket(,.*)?)$"}/{repeat_interval=""}:{alertid="XXX-00014", customer_name="XXX", env="prod", hostname="XXX-mtprd01", priority="P1", recipient="appteam/ticket,XXX/circuit", title="JVM Down"}:appteam/ticket/webhook/0
{
  "entry": {
    "groupKey": "XXX",
    "receiver": {
      "groupName": "appteam/ticket",
      "integration": "webhook"
    },
    "timestamp": "2019-09-24T15:53:55.222089917Z",
    "firingAlerts": [
      "8684883655238988612"
    ]
  },
  "expiresAt": "2019-09-29T15:53:55.222089917Z"
}
roidelapluie commented 4 years ago

For anyone interested:

https://github.com/roidelapluie/nflogerror_exporter

roidelapluie commented 4 years ago

My test has shown that I have another alert in this state.

My prometheus config used for the test

- job_name: nflogerror
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /etc/prometheus/prometheus.d/nflogerror_exporter_*.yml
    refresh_interval: 5m
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: ALERTS_IN_NFLOG_NOT_FIRING_([0-9]+)_([0-9]+)_.*
    target_label: GROUPID
    replacement: $1
    action: replace
  - source_labels: [__name__]
    separator: ;
    regex: ALERTS_IN_NFLOG_NOT_FIRING_([0-9]+)_([0-9]+)_.*
    target_label: ALERTID
    replacement: $2
    action: replace
  - source_labels: [__name__]
    separator: ;
    regex: (ALERTS_IN_NFLOG_NOT_FIRING)_[0-9]+_[0-9]+_(.*)
    target_label: __name__
    replacement: ${1}_$2
    action: replace

My PromQL query:

time () - ALERTS_IN_NFLOG_NOT_FIRING_timestamp_seconds > 86400
roidelapluie commented 4 years ago

I will now see the outcome in the coming days. So far the notifications impacted all implies multiple recipients.