SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF)

SLO Generator Version

v2.3.4

Python Version

3.9

What happened?

When using Google Cloud Monitoring backend , we sometimes (every other hour) notice wrong SLI metrics + error burn rate metrics being calculated for a short time (not correct, e.g. as there are no "bad" events). After the short time (a few minutes), the calculcated metrics are back to expected/correct numbers. We see this happen for calculations of different sliding windows like 1h, 12h, 7d or 28d. E.g. you can see a "sudden" peek in error budget burn rate for one of the sliding windows, e.g. "28 days" but other sliding windows are not affected and showing correct values.

Example SLO configuration

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: projects-inventory-query-availability
  labels:
    service_name: projects
    feature_name: inventory-query
    slo_name: availability
    team: xyz
spec:
  description: 95% of inventory query API HTTP responses are successful
  backend: cloud_monitoring
  method: good_bad_ratio
  service_level_indicator:
    filter_good: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      metric.labels.http_status = 200
    filter_valid: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      ( metric.labels.http_status = 200 OR
        metric.labels.http_status = 500 OR
        metric.labels.http_status = 501 OR
        metric.labels.http_status = 502 OR
        metric.labels.http_status = 503 OR
        metric.labels.http_status = 504 OR
        metric.labels.http_status = 505 OR
        metric.labels.http_status = 506 OR
        metric.labels.http_status = 507 OR
        metric.labels.http_status = 508 OR
        metric.labels.http_status = 509 OR
        metric.labels.http_status = 510 OR
        metric.labels.http_status = 511 )
  goal: 0.95
  frequency: "* * * * *"

What did you expect?

Correct SLI/error budget rate values when there are only "good" events.

Screenshots

Relevant log output

2023-08-14 15:28:29.414 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:29.148 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:28.093 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:27.841 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:25.684 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:23.380 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:12.086 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:08.331 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:28:01.790 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:55.168 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:52.479 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:47.765 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:38.083 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:36.766 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:19.565 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:25:55.593 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:48.714 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.925 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.663 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.216 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:38.536 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.906 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.842 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.840 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.164 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:33.929 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:31.986 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0

Quite noteworthy:

correct: Good: 979 | Bad: 0
wrong: Good: 510 | Bad: 469 -> interesting: if you sum it up, you get the same total of 979, but why are there 469 "bad" events which don't exist in reality? And after a few minutes it's back to the correct numbers?

Code of Conduct

[X] I agree to follow this project's Code of Conduct

google / slo-generator