Epic: Alert Generator Compliance Test Suite

codesome commented 2 years ago

Based on the specification, here is the list of all the high-level cases that needs to be covered by the test suite. In all the cases, the content of the alerts, APIs, time series, are checked to be correct.

[x] Presence of all the template variable and functions as described in the specification (across all the rules, not all in a single rule).
- Data
- [x] $labels.something .Labels.something
- [x] $value .Value
- Queries
- [x] query
- [x] first
- [x] label
- [x] value
- [x] sortByLabel
- Numbers
- [x] humanize
- [x] humanize1024
- [x] humanizeDuration
- [x] humanizePercentage
- [x] humanizeTimestamp
- Strings
- [x] title
- [x] toUpper
- [x] toLower
- [x] stripPort
- [x] match
- [x] reReplaceAll
- [x] parseDuration
- Others
- [x] args
- Undocumented and/or not needed:
- strvalue (undocumented, not needed)
- pathPrefix (not needed, only in consoles, and also undocumented)
- .ExternalLabels $externalLabels (not needed)
- .ExternalURL $externalURL (not needed)
- graphLink (not needed)
- tableLink (not needed)
- tmpl (not needed, only in consoles)
- safeHtml (not needed, only in consoles)
[x] Alert that goes from pending->firing->inactive.
[x] Alert that goes from pending->inactive.
[x] Rule that never becomes active (i.e. alerts in pending or firing)
[x] pending alerts having changing annotation values (checked via API)
[x] firing and inactive alerts being sent when they first went into those states.
[x] firing alert being re-sent at expected intervals when the alert is active with changing annotation contents.
[x] inactive alert being re-sent at expected intervals up to a certain time and not after that.
[x] Alert that goes directly to firing state (skipping the pending state) because of zero for duration.
[x] Alert that becomes active after having fired already and gone into inactive state for both the cases where for duration is zero and non zero. Here we should test 2 cases: One where inactive alert was still being sent, hence should stop sending that. Two is the inactive alert was not being sent anymore.
[x] Rule that produces new alerts that go from pending->firing->inactive while already having active alerts.
[x] When the for duration is non-zero and less than the evaluation interval, firing alert must be sent after the second evaluation of the rule and not before.
[x] A rule group having rules which are dependant on the ALERTS series from the rules above it in the same group.
[x] Expansion of template in annotations only use the labels from the query result as source data even if those labels get overridden by the rules. They do not use the rules' additional labels.
[x] Alert goes into inactive when there is no more data. Both when in firing and pending.

All the time comparison will be done within a certain acceptable delta and need not be exact.

codesome commented 2 years ago

cc @RichiH

RichiH commented 2 years ago

I tried to find corner cases and thought I had them a few times, but in the end couldn't find any; this seems like good coverage.

Conceptually, I think it would make sense to group the tests:

Alerts moving through states; how far do they move until they are gone and what are the conditions for moving to the next state (e.g. different for conditions, values changing, time series simply going away)
Tests within specific states (e.g. change value)
Tests that span more than a single state (e.g. annotations that constantly change which the state changes)

It might make sense to visualize a state machine to make reasoning and verification that all possible states are covered easier.

codesome commented 2 years ago

Yup, I plan to have the least number of rules possible that can test all the above cases within a time bound.

codesome commented 2 years ago

Added another case

A rule group having rules which are dependant on the ALERTS series from the rules above it in the same group.

gotjosh commented 2 years ago

One extra case to test is that annotations can depend on user-defined labels. e.g. define a rule with a custom label of env={{ $labels.namespace }} and then have an annotation that uses this label e.g. summary=the env should be {{ $labels.env }}

Assuming the alert had a namespace of eu-west-0 it should come out as alertname=myalert,namespace=eu-west-0,env=eu-west-0,summary=the env should be eu-west-0

codesome commented 2 years ago

One extra case to test is that annotations can depend on user-defined labels.

I think that is not the case. I just tried it out with Prometheus and verified how we do it in the code, we only take the labels from the query result to expand the template. And even if we override the labels in the rule, the template expansion will still take the original label from the query.

Here is the example (notice instance and test in labels and its use in annotations):

Screenshot from 2021-11-25 15-27-22

But thanks for bringing it up, we will need test case to verify this behaviour and also update the spec.

codesome commented 2 years ago

Spec has this

The labels and annotation templates from the alerting rule MUST be run for each of these alerts individually with label-value data for the template coming from the corresponding element from the result vector.

So looks like we are all good

codesome commented 2 years ago

:tada: all the cases and template variables should be covered at this point. I have excluded the template variables that are only used in template files.

There are few things still remaining to make the test suite usable. I will create new issues for them.

prometheus / compliance

Epic: Alert Generator Compliance Test Suite #54