robusta-dev / robusta

Better Prometheus alerts for Kubernetes - smart grouping, AI enrichment, and automatic remediation
https://home.robusta.dev/
MIT License
2.62k stars 260 forks source link

Opsgenie Integration does not handle alert creation and resolution properly #525

Closed prestonr83 closed 2 years ago

prestonr83 commented 2 years ago

Describe the bug The alias in opsgenie is used for deduplication and for reference for closure. However the sink is configured to just send the finding.title. https://github.com/robusta-dev/robusta/blob/5a6c73ddd3c0933df4258eef9c976a939051bb50/src/robusta/core/sinks/opsgenie/opsgenie_sink.py#L44

This causes 2 problems

  1. Alerts from different clusters with the same title will be deduplicated and only a single alert is created in Opsgenie.
  2. Alerts are not closed on resolution instead an additional alert is created.

To Reproduce For example the following alert was fired from 2 different clusters however because the Alias for both alerts is simply 'Prometheus is failing rule evaluations.' It deduplicates the alert as you can see denoted by the x2 under P1 and in the log as well image image

Because again finding.title is used when a resolved notice is sent the value of finding.title is [RESOLVED] which you can see if the following example image

So instead of closing the original alert it creates a new alert.

Expected behavior Alerts created from different clusters create Opsgenie alerts for each alert sent Alerts that are resolved are closed in Opsgenie.

The possible fix for both of these issues would be to set the alias to clustername-alertname and make sure that same alias is fired for resolved to an additional function that uses the close api call https://docs.opsgenie.com/docs/python-sdk-alert#close-alerts

aantn commented 2 years ago

Hi, thank you for reporting. We should definitely fix this.

In the PagerDuty sink we're using Finding.fingerprint which is a hash on the alert name and its labels. I also opened a PR (#528) to fix this for Findings other than AlertManager alerts.

Do you have any interest in attempting to fix this yourself and opening a PR?

You can see the code we use in OpsGenie here:

https://github.com/robusta-dev/robusta/blob/master/src/robusta/core/sinks/pagerduty/pagerduty_sink.py#L105

We also have instructions on setting up your own development environment here:

https://docs.robusta.dev/master/developer-guide/platform/index.html

If you're not interested in working on this yourself, we'll still fix it! But the contribution would be warmly appreciated.

prestonr83 commented 2 years ago

I'll give it a shot if I have some time.

aantn commented 2 years ago

@prestonr83 Cool, our team is available on Slack if you have any questions about building Robusta and/or fixing this.

Happy to help in any way we can.