Inhibition rules evaluation performance on many target alerts

Nuckal777 commented 2 months ago

Context

Requesting /alerts from my organizations alertmanager returns a reply in 4-6 seconds. The alertmanager has consistently

between 10k-13k alerts and
86 inhibition rules.

Given this scale, I was a bit surprised that fetching the alerts takes multiple seconds. Taking a trace via pprof revealed that at least 4 seconds are spent calculating inhibited alerts. Furthermore, a few hundred milliseconds are spent serializing the json response, which is expected due to the size of ~20 MiB in our case. The structure of my organizations inhibition rules can lead to ~10k alerts matching a target matcher and ~1k alerts matching the source matcher. Currently the benchmarks only consider cases with a single alert satisfying a target matcher.

Calling /alerts results basically in the following algorithm being executed at the moment:

For each T in all alerts (calls into Mutes here):
- For each R in all inhibition rules:
  - Continue if T is not valid target
  - For each S in source alerts cached by R:
    - Check for label equality
    - Prevent two-sided match

Memoize source cache target match evaluations

The Prevent two-sided match step evaluates r.TargetMatchers.Matches(a.Labels), where a is from the cached source alerts. It's invoked over and over again for each alert that is a valid target (potentially evaluating regular expressions), but it is not dependent on a specific target alert. Assuming I did not miss a side effect, the result of r.TargetMatchers.Matches(a.Labels) is only dependent on a specific inhibition rule and that rule's source alert cache. By memoizing that expression, a performance gain can be achieved when many alerts are valid targets.

Constructing indices

Checking for label equality between the set of valid source alerts and valid target alerts runs in quadratic time. By maintaining an index per label, which maps that label's values to an "index into a slice of all alerts", alerts that have a label with a specific value can found by a map lookup. For multiple labels that need to be checked for equality, multiple slices of indices can be intersected in linear time. This would be a larger change though.

Nuckal777 commented 2 months ago

I implemented the memoization in #3933, which also includes a benchmark.

grobinson-grafana commented 2 months ago

Hi! 👋 I thought GET /alerts reads the data in the marker, rather than re-calculate inhibitions, so unless I've misunderstood the code, I'm surprised that GET /alerts is slow because of inhibitor.Mutes? Could you share a pprof perhaps?

Nuckal777 commented 2 months ago

Hey, alertFilter calls setAlertStatus, which is set via Update here. The anonymous function for setAlertStatus calculates inhibitions and silences.

grobinson-grafana commented 2 months ago

Ah. So weird – I don't understand why Alertmanager needs to re-calculate inhibitions and silences for GET requests.

Anyway, thank you for your contribution. I'll make sure to take a closer look later today or tomorrow.

Nuckal777 commented 2 months ago

Friendly reminder, that I would still appreciate some feedback. Likely this got lost in between other things.

prometheus / alertmanager