While designing end-to-end tests in #360, I discovered that the Cloud Run service deployed when a new version is released did not respond to any query. With no Availability SLO, uptime check or alerting in place, I was not notified before these manual tests.
Looking at the logs, it looks like the issue has been going on for at least 30 days (the max default retention period for logs). I was not able to trace the exact source of the error but it is definitely one of the Cloud Scheduler Jobs used for simulating traffic. The Cloud Run service restarted successfully after I paused all the Cloud Scheduler Jobs, and stayed that way.
I managed to troubleshoot and fix the configuration file as well as some of the SLO definitions. I uploaded these files to the Cloud Storage bucket used by the Cloud Scheduler Jobs, and re-enabled each job one by one.
I was not able to troubleshoot and fix all the jobs though. The SLO definitions that still need attention are in the GCS bucket and date from Oct 27, 2022 (vs. the new ones, uploaded on Oct 20, 2023).
Finally, I configured an uptime check and an Availability SLO, both with alerting to lvaylet@google.com, to prevent the issue from happening again (or at least go unnoticed for a long period of time).
What did you expect?
I expected the Cloud Run service to be available for my end-to-end tests.
Screenshots
Cloud Scheduler Jobs are here:
https://console.cloud.google.com/cloudscheduler?referrer=search&authuser=2&project=slo-generator-ci-a2b4
Config and SLO definitions are here:
https://console.cloud.google.com/storage/browser/slo-generator-ci-a2b4?authuser=2&project=slo-generator-ci-a2b4
Relevant log output
No response
Code of Conduct
[X] I agree to follow this project's Code of Conduct
SLO Generator Version
v2.5.1
Python Version
3.9
What happened?
While designing end-to-end tests in #360, I discovered that the Cloud Run service deployed when a new version is released did not respond to any query. With no Availability SLO, uptime check or alerting in place, I was not notified before these manual tests.
Looking at the logs, it looks like the issue has been going on for at least 30 days (the max default retention period for logs). I was not able to trace the exact source of the error but it is definitely one of the Cloud Scheduler Jobs used for simulating traffic. The Cloud Run service restarted successfully after I paused all the Cloud Scheduler Jobs, and stayed that way.
I managed to troubleshoot and fix the configuration file as well as some of the SLO definitions. I uploaded these files to the Cloud Storage bucket used by the Cloud Scheduler Jobs, and re-enabled each job one by one.
I was not able to troubleshoot and fix all the jobs though. The SLO definitions that still need attention are in the GCS bucket and date from Oct 27, 2022 (vs. the new ones, uploaded on Oct 20, 2023).
Finally, I configured an uptime check and an Availability SLO, both with alerting to lvaylet@google.com, to prevent the issue from happening again (or at least go unnoticed for a long period of time).
What did you expect?
I expected the Cloud Run service to be available for my end-to-end tests.
Screenshots
Relevant log output
No response
Code of Conduct