google / slo-generator

SLO Generator computes SLIs, SLOs, Error Budgets and Burn Rates from supported backends, then exports an SLO report to supported targets.
Apache License 2.0
489 stars 78 forks source link

🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

Open lvaylet opened 7 months ago

lvaylet commented 7 months ago

SLO Generator Version

v2.5.2

Python Version

3.9

What happened?

After designing synthetic probes and availability/latency SLOs in the CI environment, a lot of 504 errors started to show up in the logs, in a seemingly random pattern. As a result, the error budget is burning quite fast during the outages. The incident usually resolves itself in under 10 minutes, then shows up again later on.

What did you expect?

No 504 errors and no timeouts when calling the SLO Generator.

Screenshots

No response

Relevant log output

No response

Code of Conduct

lvaylet commented 7 months ago

Example log entry:

{
  "textPayload": "The request has been terminated because it has reached the maximum request timeout. To change this limit, see https://cloud.google.com/run/docs/configuring/request-timeout",
  "insertId": "66042246000dd918ac9c23d5",
  "httpRequest": {
    "requestMethod": "POST",
    "requestUrl": "<REDACTED>",
    "requestSize": "494",
    "status": 504,
    "responseSize": "713",
    "userAgent": "Google-Cloud-Scheduler",
    "remoteIp": "107.178.196.96",
    "serverIp": "216.239.32.53",
    "latency": "29.999816421s",
    "protocol": "HTTP/1.1"
  },
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "configuration_name": "slo-generator",
      "service_name": "slo-generator",
      "revision_name": "slo-generator-00154-2gv",
      "project_id": "<REDACTED>",
      "location": "<REDACTED>"
    }
  },
  "timestamp": "2024-03-27T13:42:00.906092Z",
  "severity": "ERROR",
  "labels": {
    "instanceId": "<REDACTED>"
  },
  "logName": "projects/<REDACTED>/logs/run.googleapis.com%2Frequests",
  "trace": "projects/<REDACTED>/traces/de38c4e5fb97cd35e44c685f67a1768b",
  "receiveTimestamp": "2024-03-27T13:42:30.966597414Z",
  "spanId": "16424922216080379185"
}
lvaylet commented 7 months ago

The reported "latency" value (29.999816421s) is really close to the 30s timeout configured in Cloud Run:

image

skinlayers commented 7 months ago

Additional info: At one point we tried increasing the timeout to the max of 1hr. This did not seem to help i.e. if the SLO evaluation doesn't finish executing in 5 min, it won't finish in 1hr. To make things worse, this actually ran up our GCP bill as we were charged for the compute time while the instance was waiting to timeout. We also tried increasing the Cloud Run max concurrency and max instances. Max instances helped a little, as it would sometimes allow SLOs to be processed by additional instances, but evetually all instances would be stuck waiting to timeout.

lvaylet commented 7 months ago

Thank you @skinlayers. Same conclusions here. The requests to Cloud Run are timing out. However, increasing the timeout value does not help. The issue is somewhere in the middle. I have not found a way to reproduce it consistently though. I will enable debug mode and wait for the next occurrence.

lvaylet commented 7 months ago

@skinlayers Does this happen for a specific backend/exporter, or all backends/exporters in general?

lvaylet commented 7 months ago

I also found that deploying a new revision of the Cloud Run service removes most (all?) of the 504 errors. But they come back after some time. I ruled out memory leaks as memory utilization never goes above 35% (with 512MiB allocated to each instance of the container). Quite the opposite actually. 504 errors consistently show up when memory utilization is at its lowest point, around 22%.

lvaylet commented 7 months ago

In my CI environment, I just:

I will keep monitoring the 504 error rate, and introduce new uptime checks or synthetic probes on a regular basis.