BUG: Fix missing sleep in _watch_resource_loop

When upgrading a Loki helm release, I noticed a sharp increase in the Kubernetes API servers memory usage immediately after. I found that the loki-sc-rules sidecars (which uses the kiwigrid/k8s-sidecar image) were suddenly logging a lot more than usual, with all log lines being something like:

{"time": "2024-11-24T15:56:24.320161+00:00", "taskName": null, "msg": "ApiException when calling kubernetes: (403)\nReason: Forbidden\nHTTP response headers: HTTPHeaderDict({'Audit-Id': '33df569c-1218-4e1b-ad8e-5092c02b0d98', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'e3350d13-36fe-460d-9422-d90ba1a8d608', 'X-Kubernetes-Pf-Prioritylevel-Uid': '7c4d615c-8ab4-4786-b3c8-1f8725853156', 'Date': 'Sun, 24 Nov 2024 15:56:24 GMT', 'Content-Length': '295'})\nHTTP response body: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"secrets is forbidden: User \\\\\"system:serviceaccount:monitoring:loki\\\\\" cannot watch resource \\\\\"secrets\\\\\" in API group \\\\\"\\\\\" in the namespace \\\\\"monitoring\\\\\"\",\"reason\":\"Forbidden\",\"details\":{\"kind\":\"secrets\"},\"code\":403}\\n'\n\n", "level": "ERROR"}

Looking into it, the _watch_resource_loop seems to have had some changes in #326 where the sleeps were split into the except clauses. However, the ApiException except clause did not get its own sleep, which is causing it to create watch requests as fast as the loop allows it to.

I created my own patched image with the change and ran a small test on a single-node Kubernetes cluster. The test consisted of spinning up a small Kubernetes cluster, installing Loki using the helm chart and breaking the ClusterRoleBinding to the serviceaccount, to receive a 403 status code.

I labeled the pods with sidecar_version to more easily distinguish between the log rates:

Query: sum by(level, sidecar_version) (count_over_time({container="loki-sc-rules"} | json [$__auto]))

After changing to the patched image, the rate of ERROR logs is reduced from 200-300/sec to about 2/5sec.

kiwigrid / k8s-sidecar

BUG: Fix missing sleep in _watch_resource_loop #373