Best strategy to manage > 400 SLOs

nassim-kachroud commented 1 year ago

Hi there! We will have to create a lot of SLOs (Loading time, Errors rate for > 200 endpoints etc...) Looking at what sloth generated as rules - examples:

- record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[5m])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[5m])))

- record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])

I have a question regarding how to manage hundred of records in case of we need to specify the service name and handler properties in each record. Is it correct and is it worth to create a single record without specifying the job/service or any other property like the following ones:

- record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(rate(http_request_duration_seconds_count{code=~"(5..|429)"}[5m])))
      /
      (sum(rate(http_request_duration_seconds_count{}[5m])))

- record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{}[30d])

Thanks a lot for your help!

klubi commented 1 year ago

I've asked myself same question few days ago. I'm in similar spot (20 microservices, at least 10 SLOs per service). Currently I'm doing POC with helm where I wrap around sloth, so with service deployments I can only specify a single variable (or two), and have all SLOs be deployed. With seeing how this works for few services I did POC with, it makes it quite easy to manage if they are grouped.

nassim-kachroud commented 1 year ago

Thanks @klubi for your answer! Which records did you create then? A single to handle all the microservices like above? Thanks for sharing

klubi commented 1 year ago

In the end I'll most likely end up with both, but currently main focus is on SLO per microservice. Reasoning behind it is that different microservices can have different objectives for same metric.

snippet from my chart:

sli:
    events:
    {{- if $maybeLogsErrorLevel.totalQuery }}
      totalQuery: {{ $maybeLogsErrorLevel.totalQuery }}
    {{- else if eq $tech "spring"}}
      totalQuery: sum(rate(logback_events_total{application="{{ .name }}", namespace="{{ $namespace }}"}[{{ "{{.window}}" }}]))
    {{- else if eq $tech "go"}}
      totalQuery: sum(rate(log_statements_total{application="{{ .name }}", namespace="{{ $namespace }}"}[{{ "{{.window}}" }}]))
    {{- end}}

nassim-kachroud commented 1 year ago

I know what you mean, it works for the SLOs' config but I was more talking about the Prometheus records to not have to create 1 for each application / handler combination - example:

- record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(rate(http_request_duration_seconds_count{code=~"(5..|429)"}[5m])))
      /
      (sum(rate(http_request_duration_seconds_count{}[5m])))

klubi commented 1 year ago

Oh, sorry, I guess I misunderstood your question.

Where do you deploy your rules? Into kubernetes with operator?

Either way i think it's anyways better to keep records with services, mostly for maintainability reasons. If you deprecate a service, you just delete whole rule group (or single PrometheusRule manifest), and all service related records are deleted. Sure, definitions are duplicated, and could be simplified, but is it worth it?

nassim-kachroud commented 1 year ago

We are moving to use argocd and store these things in a file. The thing is that we can reach > 300 SLOs. When I look at the generated records we need to store for only 1 SLO, how will it look like with hundred of SLOs...

# Code generated by Sloth (v0.11.0): https://github.com/slok/sloth.
# DO NOT EDIT.

groups:
- name: sloth-slo-sli-recordings-myservice-requests-availability
  rules:
  - record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[5m])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[5m])))
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 5m
      tier: "2"
  - record: slo:sli_error:ratio_rate30m
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[30m])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[30m])))
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 30m
      tier: "2"
  - record: slo:sli_error:ratio_rate1h
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[1h])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[1h])))
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 1h
      tier: "2"
  - record: slo:sli_error:ratio_rate2h
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[2h])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[2h])))
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 2h
      tier: "2"
  - record: slo:sli_error:ratio_rate6h
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[6h])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[6h])))
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 6h
      tier: "2"
  - record: slo:sli_error:ratio_rate1d
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[1d])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[1d])))
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 1d
      tier: "2"
  - record: slo:sli_error:ratio_rate3d
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[3d])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[3d])))
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 3d
      tier: "2"
  - record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_window: 30d
      tier: "2"
- name: sloth-slo-meta-recordings-myservice-requests-availability
  rules:
  - record: slo:objective:ratio
    expr: vector(0.9990000000000001)
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      tier: "2"
  - record: slo:error_budget:ratio
    expr: vector(1-0.9990000000000001)
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      tier: "2"
  - record: slo:time_period:days
    expr: vector(30)
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      tier: "2"
  - record: slo:current_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
      / on(sloth_id, sloth_slo, sloth_service) group_left
      slo:error_budget:ratio{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      tier: "2"
  - record: slo:period_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate30d{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
      / on(sloth_id, sloth_slo, sloth_service) group_left
      slo:error_budget:ratio{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      tier: "2"
  - record: slo:period_error_budget_remaining:ratio
    expr: 1 - slo:period_burn_rate:ratio{sloth_id="myservice-requests-availability",
      sloth_service="myservice", sloth_slo="requests-availability"}
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_service: myservice
      sloth_slo: requests-availability
      tier: "2"
  - record: sloth_slo_info
    expr: vector(1)
    labels:
      owner: myteam
      repo: myorg/myservice
      sloth_id: myservice-requests-availability
      sloth_mode: cli-gen-prom
      sloth_objective: "99.9"
      sloth_service: myservice
      sloth_slo: requests-availability
      sloth_spec: prometheus/v1
      sloth_version: v0.11.0
      tier: "2"
- name: sloth-slo-alerts-myservice-requests-availability
  rules:
  - alert: MyServiceHighErrorRate
    expr: |
      (
          max(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate1h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432)) without (sloth_window)
      )
      or
      (
          max(slo:sli_error:ratio_rate30m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate6h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432)) without (sloth_window)
      )
    labels:
      category: availability
      routing_key: myteam
      severity: pageteam
      sloth_severity: page
    annotations:
      summary: High error rate on 'myservice' requests responses
      title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
        burn rate is too fast.
  - alert: MyServiceHighErrorRate
    expr: |
      (
          max(slo:sli_error:ratio_rate2h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate1d{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432)) without (sloth_window)
      )
      or
      (
          max(slo:sli_error:ratio_rate6h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate3d{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432)) without (sloth_window)
      )
    labels:
      category: availability
      severity: slack
      slack_channel: '#alerts-myteam'
      sloth_severity: ticket
    annotations:
      summary: High error rate on 'myservice' requests responses
      title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
        burn rate is too fast.

klubi commented 1 year ago

So... If you can, I'd suggest using operators... two of them: sloth and prometheus operator. Then you won't have to store long prometheus rule files, just their definitions, and both operators will make it easier to manage and deploy. As POC I created 3 metrics, used it with 24 services. File with PrometheusServiceLevel manifests is 1,3k lines long. If I would like to generate PrometheusRules from that, it would be 11,5k lines... That's why I don't store either of them... just configure helm chart at deployment time, which requires ~30 lines

NickAdolf commented 1 year ago

I'm honestly just concerned about the number of alert rule groups\namespaces this will generate. I'm on an automated path, where the file maintenance isn't an issue, but the "damage" in the UI is a bit nasty. Is there a way to have Sloth generate all rules into a single group for a given file?

Ideally I'd have a "sloth-slo" namespace with a group for each Sloth file I generate. Is this possible without manipulating the generated prometheus file?

slok / sloth

Best strategy to manage > 400 SLOs #453