google / slo-generator

SLO Generator computes SLIs, SLOs, Error Budgets and Burn Rates from supported backends, then exports an SLO report to supported targets.
Apache License 2.0
489 stars 78 forks source link

SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) #345

Open svenmueller opened 1 year ago

svenmueller commented 1 year ago

SLO Generator Version

v2.3.4

Python Version

3.9

What happened?

When using Google Cloud Monitoring backend , we sometimes (every other hour) notice wrong SLI metrics + error burn rate metrics being calculated for a short time (not correct, e.g. as there are no "bad" events). After the short time (a few minutes), the calculcated metrics are back to expected/correct numbers. We see this happen for calculations of different sliding windows like 1h, 12h, 7d or 28d. E.g. you can see a "sudden" peek in error budget burn rate for one of the sliding windows, e.g. "28 days" but other sliding windows are not affected and showing correct values.

Example SLO configuration

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: projects-inventory-query-availability
  labels:
    service_name: projects
    feature_name: inventory-query
    slo_name: availability
    team: xyz
spec:
  description: 95% of inventory query API HTTP responses are successful
  backend: cloud_monitoring
  method: good_bad_ratio
  service_level_indicator:
    filter_good: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      metric.labels.http_status = 200
    filter_valid: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      ( metric.labels.http_status = 200 OR
        metric.labels.http_status = 500 OR
        metric.labels.http_status = 501 OR
        metric.labels.http_status = 502 OR
        metric.labels.http_status = 503 OR
        metric.labels.http_status = 504 OR
        metric.labels.http_status = 505 OR
        metric.labels.http_status = 506 OR
        metric.labels.http_status = 507 OR
        metric.labels.http_status = 508 OR
        metric.labels.http_status = 509 OR
        metric.labels.http_status = 510 OR
        metric.labels.http_status = 511 )
  goal: 0.95
  frequency: "* * * * *"

What did you expect?

Correct SLI/error budget rate values when there are only "good" events.

Screenshots

Bildschirmfoto 2023-08-15 um 14 22 19

Relevant log output

2023-08-14 15:28:29.414 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:29.148 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:28.093 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:27.841 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:25.684 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:23.380 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:12.086 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:08.331 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:28:01.790 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:55.168 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:52.479 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:47.765 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:38.083 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:36.766 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:19.565 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:25:55.593 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:48.714 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.925 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.663 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.216 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:38.536 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.906 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.842 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.840 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.164 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:33.929 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:31.986 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0

Quite noteworthy:

Code of Conduct

lvaylet commented 1 year ago

Hi @svenmueller, thanks for reporting this. Apologies for the late reply. I was on vacation and off the grid.

Just like you, I was immediately surprised by the 510 + 469 == 979 coincidence upon seeing the screenshot for the first time. Any chance you could enable debug mode so we get more details about what's going on under the hood? For example by temporarily setting the DEBUG environment variable to 1?