[POC] Alerting: Update state manager to resolve series that is missing in result

yuri-tceretian commented 3 weeks ago

What is this feature? This PR changes the alerting state manager to resolve the Alerting state if it's missing in the evaluation results of a rule that meets certain criteria: the rule has a single query element and it's a Prometheus query.

Why do we need this feature? Prometheus query language supports filter operators. For, example

grafana_slo_sli_5m{} > 10

will return only metrics (aka dimensions) that have points with values greater than 10. This is a common situation in Prometheus alert rules. All results that the query returns are above the threshold and therefore should have either a pending or alerting state. When the result does not contain some metric that was seen during the previous evaluation, that is treated as a normal (resolved) state. For example,

at time T1 the query returns 2 results\dimensions { server=A } 1 and { server=B} 1. That result is converted to 2 Alerting states, and the notification service gets notified.
at time T2 the query returns only 1 result {server=A} 1. This means that the first metric is still alerting but the second gets resolved. The result is converted to {server A} Alerting and {server B} Normal. The notification service gets updated about both states.

Currently, In Grafana Managed Alerts, this works a bit differently: metrics that are missing at an evaluation cycle are ignored until they expire (marked as stale) https://github.com/grafana/grafana/blob/27884dd36271c3218ae4950ceec3c85c9ef32ec0/pkg/services/ngalert/state/manager.go#L551-L553. The behavior in the example above will be the following:

T1 - { server=A } Alerting and { server=B} Alerting
T2 - { server=A } Alerting and { server=B} Alerting
T3 - { server=A } Alerting and { server=B} Alerting (assuming the result is still only {server=A} 1)
T4 - { server=A } Alerting and { server=B} Normal (Stale)

Therefore, Grafana managed alerts to delay the resolution of the missing metric for 3 evaluation intervals. This can cause confusion, and also makes it harder to migrate a Prometheus alert rule to a managed alert rule (It is possible but requires rewriting a query to use server-side expressions instead of filtering on Prometheus side).

Who is this feature for? SLO plugin and potentially users who want to migrate to Grafana managed alerts

Special notes for your reviewer:

Please check that:

[ ] It works as expected from a user's perspective.
[ ] If this is a pre-GA feature, it is behind a feature toggle.
[ ] The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

yuri-tceretian commented 2 weeks ago

/deploy-to-hg

ephemeral-instances-bot[bot] commented 2 weeks ago

Preparing your instance. A comment containing your instance's url will be added to this PR when the instance is ready.
Your instance will be ready in ~10 minutes. Follow the workflow progress
Slack channel: #proj-ephemeral-hg-instances
Building instance with yuri-tceretian/resolve-prometheus-normal oss branch and main enterprise branch. How to choose a branch

ephemeral-instances-bot[bot] commented 2 weeks ago

Your instance can be accessed at: https://ephemeral1511182187529yuritce.grafana-dev.net
The instance is not using the CDN assets.
How to access / How to update instance config / How to build a specific branch

grafana / grafana

[POC] Alerting: Update state manager to resolve series that is missing in result #87529