google / slo-generator

SLO Generator computes SLIs, SLOs, Error Budgets and Burn Rates from supported backends, then exports an SLO report to supported targets.
Apache License 2.0
489 stars 78 forks source link

🐛 [BUG] - SLO Generator Cloud Run service in test project crashes continuously #361

Open lvaylet opened 1 year ago

lvaylet commented 1 year ago

SLO Generator Version

v2.5.1

Python Version

3.9

What happened?

While designing end-to-end tests in #360, I discovered that the Cloud Run service deployed when a new version is released did not respond to any query. With no Availability SLO, uptime check or alerting in place, I was not notified before these manual tests.

Looking at the logs, it looks like the issue has been going on for at least 30 days (the max default retention period for logs). I was not able to trace the exact source of the error but it is definitely one of the Cloud Scheduler Jobs used for simulating traffic. The Cloud Run service restarted successfully after I paused all the Cloud Scheduler Jobs, and stayed that way.

I managed to troubleshoot and fix the configuration file as well as some of the SLO definitions. I uploaded these files to the Cloud Storage bucket used by the Cloud Scheduler Jobs, and re-enabled each job one by one.

I was not able to troubleshoot and fix all the jobs though. The SLO definitions that still need attention are in the GCS bucket and date from Oct 27, 2022 (vs. the new ones, uploaded on Oct 20, 2023).

Finally, I configured an uptime check and an Availability SLO, both with alerting to lvaylet@google.com, to prevent the issue from happening again (or at least go unnoticed for a long period of time).

What did you expect?

I expected the Cloud Run service to be available for my end-to-end tests.

Screenshots

Cloud Scheduler Jobs are here:
https://console.cloud.google.com/cloudscheduler?referrer=search&authuser=2&project=slo-generator-ci-a2b4

Config and SLO definitions are here:
https://console.cloud.google.com/storage/browser/slo-generator-ci-a2b4?authuser=2&project=slo-generator-ci-a2b4

Relevant log output

No response

Code of Conduct