Open lvaylet opened 7 months ago
Example log entry:
{
"textPayload": "The request has been terminated because it has reached the maximum request timeout. To change this limit, see https://cloud.google.com/run/docs/configuring/request-timeout",
"insertId": "66042246000dd918ac9c23d5",
"httpRequest": {
"requestMethod": "POST",
"requestUrl": "<REDACTED>",
"requestSize": "494",
"status": 504,
"responseSize": "713",
"userAgent": "Google-Cloud-Scheduler",
"remoteIp": "107.178.196.96",
"serverIp": "216.239.32.53",
"latency": "29.999816421s",
"protocol": "HTTP/1.1"
},
"resource": {
"type": "cloud_run_revision",
"labels": {
"configuration_name": "slo-generator",
"service_name": "slo-generator",
"revision_name": "slo-generator-00154-2gv",
"project_id": "<REDACTED>",
"location": "<REDACTED>"
}
},
"timestamp": "2024-03-27T13:42:00.906092Z",
"severity": "ERROR",
"labels": {
"instanceId": "<REDACTED>"
},
"logName": "projects/<REDACTED>/logs/run.googleapis.com%2Frequests",
"trace": "projects/<REDACTED>/traces/de38c4e5fb97cd35e44c685f67a1768b",
"receiveTimestamp": "2024-03-27T13:42:30.966597414Z",
"spanId": "16424922216080379185"
}
The reported "latency" value (29.999816421s
) is really close to the 30s timeout configured in Cloud Run:
Additional info: At one point we tried increasing the timeout to the max of 1hr. This did not seem to help i.e. if the SLO evaluation doesn't finish executing in 5 min, it won't finish in 1hr. To make things worse, this actually ran up our GCP bill as we were charged for the compute time while the instance was waiting to timeout. We also tried increasing the Cloud Run max concurrency and max instances. Max instances helped a little, as it would sometimes allow SLOs to be processed by additional instances, but evetually all instances would be stuck waiting to timeout.
Thank you @skinlayers. Same conclusions here. The requests to Cloud Run are timing out. However, increasing the timeout value does not help. The issue is somewhere in the middle. I have not found a way to reproduce it consistently though. I will enable debug mode and wait for the next occurrence.
@skinlayers Does this happen for a specific backend/exporter, or all backends/exporters in general?
I also found that deploying a new revision of the Cloud Run service removes most (all?) of the 504 errors. But they come back after some time. I ruled out memory leaks as memory utilization never goes above 35% (with 512MiB allocated to each instance of the container). Quite the opposite actually. 504 errors consistently show up when memory utilization is at its lowest point, around 22%.
In my CI environment, I just:
slo_cm_gae_latency724ms
, triggered by Cloud Scheduler every minute.DEBUG
set to 1
.I will keep monitoring the 504 error rate, and introduce new uptime checks or synthetic probes on a regular basis.
SLO Generator Version
v2.5.2
Python Version
3.9
What happened?
After designing synthetic probes and availability/latency SLOs in the CI environment, a lot of 504 errors started to show up in the logs, in a seemingly random pattern. As a result, the error budget is burning quite fast during the outages. The incident usually resolves itself in under 10 minutes, then shows up again later on.
What did you expect?
No 504 errors and no timeouts when calling the SLO Generator.
Screenshots
No response
Relevant log output
No response
Code of Conduct