Open roidelapluie opened 4 years ago
Picture of the first incident in prometheys (pending then firing)
From a backup between the incidents (24th 20h). we see that the alert is still there (but it was no longer firing in prometheus)
Entry: {}/{recipient=~"^(?:(.*,)?appteam/ticket(,.*)?)$"}/{repeat_interval=""}:{alertid="XXX-00014", customer_name="XXX", env="prod", hostname="XXX-mtprd01", priority="P1", recipient="appteam/ticket,XXX/circuit", title="JVM Down"}:appteam/ticket/webhook/0
{
"entry": {
"groupKey": "XXX",
"receiver": {
"groupName": "appteam/ticket",
"integration": "webhook"
},
"timestamp": "2019-09-24T15:53:55.222089917Z",
"firingAlerts": [
"8684883655238988612"
]
},
"expiresAt": "2019-09-29T15:53:55.222089917Z"
}
For anyone interested:
My test has shown that I have another alert in this state.
My prometheus config used for the test
- job_name: nflogerror
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
file_sd_configs:
- files:
- /etc/prometheus/prometheus.d/nflogerror_exporter_*.yml
refresh_interval: 5m
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: ALERTS_IN_NFLOG_NOT_FIRING_([0-9]+)_([0-9]+)_.*
target_label: GROUPID
replacement: $1
action: replace
- source_labels: [__name__]
separator: ;
regex: ALERTS_IN_NFLOG_NOT_FIRING_([0-9]+)_([0-9]+)_.*
target_label: ALERTID
replacement: $2
action: replace
- source_labels: [__name__]
separator: ;
regex: (ALERTS_IN_NFLOG_NOT_FIRING)_[0-9]+_[0-9]+_(.*)
target_label: __name__
replacement: ${1}_$2
action: replace
My PromQL query:
time () - ALERTS_IN_NFLOG_NOT_FIRING_timestamp_seconds > 86400
I will now see the outcome in the coming days. So far the notifications impacted all implies multiple recipients.
What did you do?
We have an alert with 3 webhook receivers.
The alert was sent yesterday evening, then again today.
But the alerts of today were not sent to the webhook receivers.
What did you expect to see?
Calls to webhooks yesterday and today.
What did you see instead? Under which circumstances?
The webhooks were only called yesterday.
Environment
*
Alertmanager version:
0.18.0
Prometheus version:
2.11 and 2.12
Deserialization of NFLOG during the second incident (where we did not receive notifications):
Deserialization AFTER today event is resolved:
The timestamp of the first one is the BEGINNING of the first event. The timestamp of the second one is the END of the second event.
I would have expected the first one to be the END of the first event?