Alertmanager: Grafana unusable with HA-configured Prometheus Alertmanager

tonypowa commented 7 months ago

What happened?

<tl;dr>

Grafana by default has no way of listing several Alertmanager instances in a HA configuration as a datasource. When reading alerts this is fine, but when sending alerts, these must be sent to all Alertmanager Instances as Alertmanager does not propagate these by itself

</tl;dr>

I have configured Grafana to use a Prometheus Alertmanager and forward all Grafana Alerts to those instances, using a Kubernetes headless SVC dns entry in the Grafana Datasources UI.

When Grafana sends alerts to the alertmanagers, they end up being load balanced, causing each Alertmanager instance to have a different set of active alerts than the rest.

The "endsAt" field that autoresolves an alert if no new updates is received is also sometimes triggered making the alert appear to be flopping, even though the Grafana configured alert is still firing.


alertmanager-0 alertmanager ts=2023-12-22T14:24:05.818Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Disk Space cpmgmt1[2223b1a][active]"

alertmanager-2 alertmanager ts=2023-12-22T14:24:09.800Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Disk Space cp-fw1-2[0e36b69][active]"

alertmanager-0 alertmanager ts=2023-12-22T14:24:09.872Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Disk Space cp-fw2-1[b41b20f][active]"

Above is a excerpt from the Prometheus Alertmanager logs across all (3) instances, filtered on the alert name "Disk Space.*". This shows that the alert which is configured in Grafana is not sent to all alertmanagers, as each unique alert is only seen by one alertmanager.

This causes the next update of the alert state to go to an another instance of the Alertmanagers, causing the original alert to resolve via endsAt field timing out - normally after few minutes.

What did you expect to happen?

According to Alertmanager Documentation Alertmanager does not sync alerts between instances, all alerts must be sent to all Alertmanagers.

This does not happen when Grafana alerts are sent to Alertmanager. Filtering logs on an alert sent from Prometheus clearly shows the alert sent to all instances at the same time:


alertmanager-0 alertmanager ts=2023-12-22T14:31:26.216Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=KubePodCrashLooping[f136c7c][active]

alertmanager-1 alertmanager ts=2023-12-22T14:31:26.216Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=KubePodCrashLooping[f136c7c][active]

alertmanager-2 alertmanager ts=2023-12-22T14:31:26.217Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=KubePodCrashLooping[f136c7c][active]

When using a Headless SVC in Kubernetes, translating to multiple A-records, I expect Grafana to multiplex and send to all Alertmanagers, as this is clearly how Alertmanager expects to be used.

And I would expect Grafana to discover or somehow communicate that this issue in the UI.

Did this work before?

no

How do we reproduce it?

Set up Prometheus Alertmanager in HA mode
Set up Prometheus Alertmanager as default destination of all alerts
Create more than one alert that is firing in Grafana
Retrieve status of alerts from the different Alertmanager instances and observe the differences

Is the bug inside a dashboard panel?

No response

Environment (with versions)?

Grafana: v10.2.2

OS: Linux/Kubernetes/GKE

Browser: Safari

Grafana platform?

Kubernetes

Datasource(s)?

Prometheus Alert manager

edit: improved language and explanation

tonypowa commented 7 months ago

This issue is a test copy of an issue in another repo. Original issue: https://github.com/grafana/grafana/issues/#79834

moxious commented 7 months ago

Summary: The GitHub issue reports that Grafana, when used with HA-configured Prometheus Alertmanager, fails to send alerts to all instances, leading to inconsistent alert states. Despite load balancing, each Alertmanager instance shows different sets of active alerts, causing unresolved issues and auto-resolving alerts due to timed out 'endsAt' fields. The expected behavior is for Grafana to send alerts to all Alertmanager instances in line with the Alertmanager documentation for high-availability setups.

moxious commented 7 months ago

Hello @tonypowa, thank you for the detailed report. It appears your issue is related to how Grafana interacts with Prometheus Alertmanager when set up in HA mode. Because of the nature of the problem concerning alerting and managing Alertmanager instances, this would be best addressed by the Alerting project. I suggest redirecting this issue to their repository so they can provide more insight into the configuration expectations and possible solutions or workarounds for handling alerts in HA scenarios.

moxious commented 7 months ago

Elaboration:

Thank you for submitting your issue and providing detailed information about the behavior you're experiencing when configuring Grafana with a high availability (HA) Prometheus Alertmanager setup. Your description of load balancing leading to inconsistent alert states across Alertmanager instances is a scenario that others may encounter as well, so it's a valuable report. In order to assist the developers and maintainers in addressing your issue, I have a few clarifying questions and requests for additional details:

Could you provide the version of the Prometheus Alertmanager you are using?
Is this Grafana instance also running in Kubernetes alongside the Alertmanager, or is it external to the cluster?
Could you describe more specifically how you have configured the data source in Grafana to point to Alertmanager (e.g., any specific settings you've used or screenshots of the configuration)?
Have you made any custom modifications to the Alertmanager's configuration file that could affect its HA behavior?
Are you observing any relevant logs from Grafana when it sends alerts to Alertmanager that might provide additional insight?
Can you provide a minimal reproducible example of the Grafana alert configuration that triggers this issue?

Providing these additional details can help with diagnosing the issue more precisely. Your cooperation is greatly appreciated!

moxious / triage