pyrra-dev / pyrra

Making SLOs with Prometheus manageable, accessible, and easy to use for everyone!
https://demo.pyrra.dev
Apache License 2.0
1.22k stars 109 forks source link

Proposal for Saturation SLO #964

Open ArthurSens opened 11 months ago

ArthurSens commented 11 months ago

For a few days now I've been wondering how the implementation would look like for a Saturation SLO based on Prometheus metrics. I've come up with a design idea, so I'm opening this issue to discuss this further with the community.

The main idea here is to re-utilize the BoolGauge SLO as much as possible.

API:

type SaturationIndicator struct {
    // Utilization is the metric that represents the current utilization of the monitored resource.
    Utilization Query `json:"utilization"`

    // Capacity is the metric that represents the capacity of the monitored resource.
    Capacity Query `json:"capacity"`

    // Threshold is the maximum utilization allowed of the monitored resource.
        // It should represent a percentage between Utilization and Capacity.
    // It should be a number between 0 and 1.
    Threshold float64 `json:"threshold"`

    // +optional
    // Grouping allows an SLO to be defined for many SLI at once, like HTTP handlers for example.
    Grouping []string `json:"grouping"`
}

For additional Prometheus rules, all we need to do is generate vector(1) if (Utilization / Capacity) > Threshold and vector(0) if (Utilization / Capacity) <= Threshold. From this, we can reutilize the same prometheus rules used for BoolGauge:

- record: example-saturation-bool
  expr: |
    (vector(1) AND (Utilization / Capacity) > Threshold)
    OR
    vector(0)

## Same from BoolGauge below
- record: example-saturation-bool:count1w
  expr: sum (count_over_time(example-saturation-bool[1w]))

- record: example-saturation-bool:sum1w
  expr: sum (sum_over_time(example-saturation-bool[1w]))

- record: example-saturation-bool:burnrate1m
  expr: (sum (count_over_time(example-saturation-bool[1m])) - sum (sum_over_time(probe_success[1m]))) / sum (count_over_time(example-saturation-bool[1m]))
.
.
.
ArthurSens commented 11 months ago

@metalmatze, friendly ping! Would love to open a PR myself once we agree on a design :)

metalmatze commented 10 months ago

Sorry for the late reply. I was busy organizing PromCon, speaking at SRECon and afterward moving house.

The overall proposal looks good to me. I want to make sure to try this. If we can figure out the PromQL the rest should fall into place.