prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.46k stars 2.12k forks source link

OpsGenie cross-team deduplication - alias prefix #3639

Open kbudde opened 7 months ago

kbudde commented 7 months ago

OpsGenie has built in deduplication of alerts. This is a problem for us because we have multiple teams using the same OpsGenie account (different teams). Deduplication is based on the opsGenie alias field. This field is set to the alertmanager alert key.Hash. This means that if two teams have the same alert key, they will be deduplicated even if they are different alerts.

As a result, we have alerts which are only sent to one team, but are deduplicated on another team. The same can happen for resolving alerts. If the alert is resolved in the team that did not receive the alert, it will not be resolved in the team that did receive the alert and stay open until it is resolved manually.

One workaround is to configure a prefix for the Promtheus Integration in OpsGenie. This will add the prefix to the alias field. This will prevent deduplication between teams. Unfortunately, this does not support updating summary and description fields (config: update_alerts).

Another workaround is to change the "group by" rules in alertmanager and add another label, e.g. team. But this configuration is error prone.

Proposed solution: Add a new field to the OpsGenie integration configuration in alertmanager. This field will be added to the alias field. This will allow us to configure a prefix for the alias field and prevent deduplication between teams.

It will also allow use cases where the alerts from different monitoring clusters should not be deduplicated. https://github.com/prometheus/alertmanager/issues/1598

grobinson-grafana commented 3 months ago

Hi! :wave:

This field is set to the alertmanager alert key.Hash.

It should be based on the Group Key from what I can see in the code.

This means that if two teams have the same alert key, they will be deduplicated even if they are different alerts.

To make sure I understand the issue, you have (at least) two teams that either share the same route, or you have two routes with the same receiver, matchers and continue: true?

kbudde commented 2 months ago

Hi, The issue can happen in several ways, but the easiest one is probably, Two teams share the same opsgenie Organisation, they have both access to the same cluster and monitor the cluster health (e.g. using default set of alerts).

Only one of both teams will see the alerts in opsgenie due to opsgenie dedup.

grobinson-grafana commented 2 months ago

Two teams share the same opsgenie Organisation, they have both access to the same cluster and monitor the cluster health (e.g. using default set of alerts).

So just to make sure I understand 100%. There is an alert (or set of alerts) that are routed to both teams, either via a single route that has a receiver with multiple integrations, or two different routes. For example, something like this:

issue-3639-example

However, in this case, the alert delivered to both teams in Opsgenie has the same key because they are the same alert. This means for example, that Team 1 will receive the alert and Team 2 will not?

One workaround is to configure a prefix for the Promtheus Integration in OpsGenie. This will add the prefix to the alias field. This will prevent deduplication between teams. Unfortunately, this does not support updating summary and description fields (config: update_alerts).

Can you tell me more about how this works and why the summary and description is not updated?

Another workaround is to change the "group by" rules in alertmanager and add another label, e.g. team. But this configuration is error prone.

Would this even work if both teams share the same alerts, as in the diagram above?

kbudde commented 2 months ago

However, in this case, the alert delivered to both teams in Opsgenie has the same key because they are the same alert. This means for example, that Team 1 will receive the alert and Team 2 will not? Yes, exactly. This is described here (missing out the part that the deduplication ignores team boundaries): https://support.atlassian.com/opsgenie/docs/what-is-alert-de-duplication/

In the configuration of the prometheus integration in opsgenie, you can add a unique prefix to the alias per api key (needs some license above basic): image This is supported for Create, Close, Acknowledge and for adding notes to alerts. So this workaround allows adding a prefix on server side to each integration -> Each team will now receive create and close events for alerts 👍🏽

But, there is no alias configuration for "Update alert" on OpsGenie. As a result the update of an alert will not reach the Opsgenie alert. This is the case if another alert is added to an existing alert group in alertmanager and update_alert is enabled.

I did not found any workaround if you have multiple teams in opsgenie to make both working. Therefore: https://github.com/prometheus/alertmanager/pull/3640

I noticed this while trying to get opsgenie looking like this: image

This needs also: https://github.com/prometheus/alertmanager/pull/3430

kbudde commented 2 months ago

@gotjosh can you have a look please? The current opsgenie integration is surprisingly bad. Therefore this PR and #3430