prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.59k stars 2.14k forks source link

Add Metric Buckets for Notification Latency #3999

Closed kennytrytek-wf closed 1 month ago

kennytrytek-wf commented 1 month ago

Description

We wanted to use notification_latency_seconds to indicate whether our alert manager was overly-delayed in sending notifications. The highest configured bucket boundary is 20 seconds, which is lower than we would like. A latency of up to 60 seconds is acceptable to us, but the histogram does not allow for alerting based on that latency threshold.

https://github.com/prometheus/alertmanager/blob/c7097ad76c07c7fc325292718115e3de9d0a125f/notify/notify.go#L309

Want

Add a 60s bucket to the metric.

grobinson-grafana commented 1 month ago

I think the reason the largest bucket is 20s is because the default peer timeout is 15s. If it takes more than 15s to send a notification then HA starts failing over. The other issue is depending on the Group interval, notifications might never be delivered as each time the Group interval timer elapses any in-flight notifications from the previous timer are canceled. Here is an example of that:

ts=2024-08-27T09:25:00.016Z caller=dispatch.go:166 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2024-08-27T09:25:15.017Z caller=dispatch.go:526 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2024-08-27T09:25:45.017Z caller=dispatch.go:357 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="webhook/webhook[0]: notify retry canceled after 2 attempts: context deadline exceeded"
ts=2024-08-27T09:25:45.017Z caller=dispatch.go:526 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2024-08-27T09:26:15.018Z caller=dispatch.go:357 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="webhook/webhook[0]: notify retry canceled after 2 attempts: context deadline exceeded"
ts=2024-08-27T09:26:15.018Z caller=dispatch.go:526 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]

A latency of up to 60 seconds is acceptable to us, but the histogram does not allow for alerting based on that latency threshold.

Given the above, I would recommend decreasing the latency of the service receiving notifications. If that is not possible, you could also consider queueing them in some intermediate service that immediately acknowledges the notification from Alertmanager, and then forwards the notification on to the intended service.

kennytrytek-wf commented 1 month ago

For a little more context, we're running a self-hosted LGTM stack that has experienced some occasional increased alert notification latency. This behavior occurs after a Mimir ingester pod dies, and it looks like traffic continues to be sent from the alertmanager to the dead ingester for up to five minutes until the ingester pod is replaced. During that five minutes, the cortex_alertmanager_notification_latency_seconds_bucket duration increased to around 40 seconds, which is between the 20s and +Inf bucket boundaries.

Decreasing the latency of the receiving service is not really an option here since we do not control the code of the running services and are prevented by license from changing it. Interestingly, I don't see any logs indicating alert failure from the alertmanager container, only increased latency.

grobinson-grafana commented 1 month ago

Hi! Thanks for the context. I'm not sure I understand though.

from the alertmanager to the dead ingester for up to five minutes until the ingester pod is replaced

You are sending notifications from Alertmanager to the Mimir ingesters? I don't see how that's possible? Alertmanager shouldn't be sending any traffic to Mimir ingesters, other than metric scrapes via an agent like Grafana Alloy.

Do you know why the latency of your receiving service is so high? Alertmanager doesn't like receiving services with high latency for the reasons mentioned. You need to be a little bit careful when choosing Group wait and Group intervals in your Alertmanager configuration as otherwise you may not receive notifications, or receive endless duplicates as the notification never succeeds from Alertmanager's perspective.

Onto the chosen buckets, we can maybe add some more, but can you share why you chose 60 and 300 seconds? How does this choice of buckets help other users? 300 seconds seems excessive?

kennytrytek-wf commented 1 month ago

Okay, sorry about that. I reviewed my notes again, and I think conflated two different things. We are not sending notifications to ingesters, but we are using the ruler component, which does query the ingester. Because an ingester cycled, the ruler received errors querying the ingesters. There was a period of latency reported on Mimir notifications that began precisely when the ingester went down, and ended precisely as the new ingester became ready.

Does the metric reported also take into account the time for the ruler to query and get a result? If the metric only reports the duration from the alertmanager to the receiving service, then I can understand we would need to introduce a more stable alert receiver, probably a simple queue.

Our group_* settings are the defaults, so there shouldn't be anything to do there.

can you share why you chose 60 and 300 seconds?

We chose those values because 60 seconds is an acceptable delay for a critical notification to be sent, assuming that is an unusual behavior. Typical behavior would expect the latency to be much lower, less than a second. 300 seconds would help distinguish between a high priority issue and a critical one. More than a five minute delay in alerts will start to significantly hamper remediation time and increase the chances of an issue felt by customers.

Is there a better place to continue this conversation? I don't know if it really fits an Alertmanager issue, but I'm fine continuing here if it's not a problem.

grobinson-grafana commented 1 month ago

If the metric only reports the duration from the alertmanager to the receiving service, then I can understand we would need to introduce a more stable alert receiver, probably a simple queue.

That is correct. The metric just measures the duration of notifications between Alertmanager and receiving service. It doesn't have anything to do with the ruler, whether Prometheus or Mimir.

kennytrytek-wf commented 1 month ago

👍 Nothing to do here.