grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
60.95k stars 11.63k forks source link

[POC] Alerting: Update state manager to resolve series that is missing in result #87529

Closed yuri-tceretian closed 2 weeks ago

yuri-tceretian commented 3 weeks ago

What is this feature? This PR changes the alerting state manager to resolve the Alerting state if it's missing in the evaluation results of a rule that meets certain criteria: the rule has a single query element and it's a Prometheus query.

Why do we need this feature? Prometheus query language supports filter operators. For, example

grafana_slo_sli_5m{} > 10

will return only metrics (aka dimensions) that have points with values greater than 10. This is a common situation in Prometheus alert rules. All results that the query returns are above the threshold and therefore should have either a pending or alerting state. When the result does not contain some metric that was seen during the previous evaluation, that is treated as a normal (resolved) state. For example,

Currently, In Grafana Managed Alerts, this works a bit differently: metrics that are missing at an evaluation cycle are ignored until they expire (marked as stale) https://github.com/grafana/grafana/blob/27884dd36271c3218ae4950ceec3c85c9ef32ec0/pkg/services/ngalert/state/manager.go#L551-L553. The behavior in the example above will be the following:

Therefore, Grafana managed alerts to delay the resolution of the missing metric for 3 evaluation intervals. This can cause confusion, and also makes it harder to migrate a Prometheus alert rule to a managed alert rule (It is possible but requires rewriting a query to use server-side expressions instead of filtering on Prometheus side).

Who is this feature for? SLO plugin and potentially users who want to migrate to Grafana managed alerts

Special notes for your reviewer:

Please check that:

yuri-tceretian commented 2 weeks ago

/deploy-to-hg

ephemeral-instances-bot[bot] commented 2 weeks ago
ephemeral-instances-bot[bot] commented 2 weeks ago