kiwigrid / k8s-sidecar

This is a docker container intended to run inside a kubernetes cluster to collect config maps with a specified label and store the included files in a local folder.
MIT License
613 stars 183 forks source link

BUG: Fix missing sleep in _watch_resource_loop #373

Open yetisage opened 3 days ago

yetisage commented 3 days ago

When upgrading a Loki helm release, I noticed a sharp increase in the Kubernetes API servers memory usage immediately after. I found that the loki-sc-rules sidecars (which uses the kiwigrid/k8s-sidecar image) were suddenly logging a lot more than usual, with all log lines being something like:

{"time": "2024-11-24T15:56:24.320161+00:00", "taskName": null, "msg": "ApiException when calling kubernetes: (403)\nReason: Forbidden\nHTTP response headers: HTTPHeaderDict({'Audit-Id': '33df569c-1218-4e1b-ad8e-5092c02b0d98', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'e3350d13-36fe-460d-9422-d90ba1a8d608', 'X-Kubernetes-Pf-Prioritylevel-Uid': '7c4d615c-8ab4-4786-b3c8-1f8725853156', 'Date': 'Sun, 24 Nov 2024 15:56:24 GMT', 'Content-Length': '295'})\nHTTP response body: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"secrets is forbidden: User \\\\\"system:serviceaccount:monitoring:loki\\\\\" cannot watch resource \\\\\"secrets\\\\\" in API group \\\\\"\\\\\" in the namespace \\\\\"monitoring\\\\\"\",\"reason\":\"Forbidden\",\"details\":{\"kind\":\"secrets\"},\"code\":403}\\n'\n\n", "level": "ERROR"}

Looking into it, the _watch_resource_loop seems to have had some changes in #326 where the sleeps were split into the except clauses. However, the ApiException except clause did not get its own sleep, which is causing it to create watch requests as fast as the loop allows it to.

I created my own patched image with the change and ran a small test on a single-node Kubernetes cluster. The test consisted of spinning up a small Kubernetes cluster, installing Loki using the helm chart and breaking the ClusterRoleBinding to the serviceaccount, to receive a 403 status code.

I labeled the pods with sidecar_version to more easily distinguish between the log rates:

Query: sum by(level, sidecar_version) (count_over_time({container="loki-sc-rules"} | json [$__auto])) image

After changing to the patched image, the rate of ERROR logs is reduced from 200-300/sec to about 2/5sec.

yetisage commented 3 days ago

Ironically, this will probably also happen if the Kubernetes API server returns a 429 (Too Many Requests) error 😃