The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
What is this feature?
This PR changes the alerting state manager to resolve the Alerting state if it's missing in the evaluation results of a rule that meets certain criteria: the rule has a single query element and it's a Prometheus query.
Why do we need this feature?
Prometheus query language supports filter operators. For, example
grafana_slo_sli_5m{} > 10
will return only metrics (aka dimensions) that have points with values greater than 10. This is a common situation in Prometheus alert rules. All results that the query returns are above the threshold and therefore should have either a pending or alerting state. When the result does not contain some metric that was seen during the previous evaluation, that is treated as a normal (resolved) state.
For example,
at time T1 the query returns 2 results\dimensions { server=A } 1 and { server=B} 1. That result is converted to 2 Alerting states, and the notification service gets notified.
at time T2 the query returns only 1 result {server=A} 1. This means that the first metric is still alerting but the second gets resolved. The result is converted to {server A} Alerting and {server B} Normal. The notification service gets updated about both states.
T1 - { server=A } Alerting and { server=B} Alerting
T2 - { server=A } Alerting and { server=B} Alerting
T3 - { server=A } Alerting and { server=B} Alerting (assuming the result is still only {server=A} 1)
T4 - { server=A } Alerting and { server=B} Normal (Stale)
Therefore, Grafana managed alerts to delay the resolution of the missing metric for 3 evaluation intervals. This can cause confusion, and also makes it harder to migrate a Prometheus alert rule to a managed alert rule (It is possible but requires rewriting a query to use server-side expressions instead of filtering on Prometheus side).
Who is this feature for?
SLO plugin and potentially users who want to migrate to Grafana managed alerts
Special notes for your reviewer:
Please check that:
[ ] It works as expected from a user's perspective.
[ ] If this is a pre-GA feature, it is behind a feature toggle.
What is this feature? This PR changes the alerting state manager to resolve the Alerting state if it's missing in the evaluation results of a rule that meets certain criteria: the rule has a single query element and it's a Prometheus query.
Why do we need this feature? Prometheus query language supports filter operators. For, example
will return only metrics (aka dimensions) that have points with values greater than 10. This is a common situation in Prometheus alert rules. All results that the query returns are above the threshold and therefore should have either a pending or alerting state. When the result does not contain some metric that was seen during the previous evaluation, that is treated as a normal (resolved) state. For example,
{ server=A } 1
and{ server=B} 1
. That result is converted to 2 Alerting states, and the notification service gets notified.{server=A} 1
. This means that the first metric is still alerting but the second gets resolved. The result is converted to{server A} Alerting
and{server B} Normal
. The notification service gets updated about both states.Currently, In Grafana Managed Alerts, this works a bit differently: metrics that are missing at an evaluation cycle are ignored until they expire (marked as stale) https://github.com/grafana/grafana/blob/27884dd36271c3218ae4950ceec3c85c9ef32ec0/pkg/services/ngalert/state/manager.go#L551-L553. The behavior in the example above will be the following:
{ server=A } Alerting
and{ server=B} Alerting
{ server=A } Alerting
and{ server=B} Alerting
{ server=A } Alerting
and{ server=B} Alerting
(assuming the result is still only{server=A} 1
){ server=A } Alerting
and{ server=B} Normal (Stale)
Therefore, Grafana managed alerts to delay the resolution of the missing metric for 3 evaluation intervals. This can cause confusion, and also makes it harder to migrate a Prometheus alert rule to a managed alert rule (It is possible but requires rewriting a query to use server-side expressions instead of filtering on Prometheus side).
Who is this feature for? SLO plugin and potentially users who want to migrate to Grafana managed alerts
Special notes for your reviewer:
Please check that: