grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
65.26k stars 12.17k forks source link

Alerting: Notification Errors API doesn't work on HA #64732

Open gotjosh opened 1 year ago

gotjosh commented 1 year ago

The notification errors API is meant to communicate whether a notification fired successfully or not.

However, in HA, this is broken given traffic between Grafana instances is meant to be load balanced, and this information is only stored in-memory.

As a result, users will see inconsistent results depending on which instance they hit.

There are several options we can take here:

  1. We could propagate the notification state as part of the notification log so other instances can also pick it up.
  2. We could save the state to the database so that other instances can also read it.

2 is much simpler but 1 allows us to use the upstream semantics for it and increasingly the likelihood of landing this work in upstream.

grafanabot commented 1 year ago

@gotjosh please add one or more appropriate labels. Here are some tips:

grobinson-grafana commented 1 year ago

This morning I met with @gotjosh and @santihernandezc to discuss how this feature should work. We agreed that before discussing how to deliver this feature on a technical level, we we will first re-think how the feature works, as how it works at present has a couple of issues:

  1. If a contact point/integration is used between multiple routes or for multiple aggregation groups, the last error can be overwritten just milliseconds after it occurred. In a lot of cases, the user does not have time to open the UI and see the error.

  2. Just being able to see the last error doesn't tell the big picture. How many attempts did it take to deliver the notification? Was it delivered at all?

Instead, we were thinking about a timeline of events. You can think of this as "snapshots" of an nflog, where the nflog also contains metadata about failed and retried notifications. To start with, this could be a table where entries are indexed over time.

We did not have time to reach consensus on a design, but instead agreed to think about it and keep making progress together.

grobinson-grafana commented 1 year ago

Related discussion from 2015 https://github.com/prometheus/alertmanager/issues/172.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had activity in the last year. It will be closed in 30 days if no further activity occurs. Please feel free to leave a comment if you believe the issue is still relevant. Thank you for your contributions!