This is a docker container intended to run inside a kubernetes cluster to collect config maps with a specified label and store the included files in a local folder.
MIT License
613
stars
183
forks
source link
BUG: Fix missing sleep in _watch_resource_loop #373
When upgrading a Loki helm release, I noticed a sharp increase in the Kubernetes API servers memory usage immediately after.
I found that the loki-sc-rules sidecars (which uses the kiwigrid/k8s-sidecar image) were suddenly logging a lot more than usual, with all log lines being something like:
{"time": "2024-11-24T15:56:24.320161+00:00", "taskName": null, "msg": "ApiException when calling kubernetes: (403)\nReason: Forbidden\nHTTP response headers: HTTPHeaderDict({'Audit-Id': '33df569c-1218-4e1b-ad8e-5092c02b0d98', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'e3350d13-36fe-460d-9422-d90ba1a8d608', 'X-Kubernetes-Pf-Prioritylevel-Uid': '7c4d615c-8ab4-4786-b3c8-1f8725853156', 'Date': 'Sun, 24 Nov 2024 15:56:24 GMT', 'Content-Length': '295'})\nHTTP response body: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"secrets is forbidden: User \\\\\"system:serviceaccount:monitoring:loki\\\\\" cannot watch resource \\\\\"secrets\\\\\" in API group \\\\\"\\\\\" in the namespace \\\\\"monitoring\\\\\"\",\"reason\":\"Forbidden\",\"details\":{\"kind\":\"secrets\"},\"code\":403}\\n'\n\n", "level": "ERROR"}
Looking into it, the _watch_resource_loop seems to have had some changes in #326 where the sleeps were split into the except clauses. However, the ApiException except clause did not get its own sleep, which is causing it to create watch requests as fast as the loop allows it to.
I created my own patched image with the change and ran a small test on a single-node Kubernetes cluster.
The test consisted of spinning up a small Kubernetes cluster, installing Loki using the helm chart and breaking the ClusterRoleBinding to the serviceaccount, to receive a 403 status code.
I labeled the pods with sidecar_version to more easily distinguish between the log rates:
Query:
sum by(level, sidecar_version) (count_over_time({container="loki-sc-rules"} | json [$__auto]))
After changing to the patched image, the rate of ERROR logs is reduced from 200-300/sec to about 2/5sec.
When upgrading a Loki helm release, I noticed a sharp increase in the Kubernetes API servers memory usage immediately after. I found that the
loki-sc-rules
sidecars (which uses thekiwigrid/k8s-sidecar
image) were suddenly logging a lot more than usual, with all log lines being something like:Looking into it, the
_watch_resource_loop
seems to have had some changes in #326 where the sleeps were split into the except clauses. However, the ApiException except clause did not get its own sleep, which is causing it to create watch requests as fast as the loop allows it to.I created my own patched image with the change and ran a small test on a single-node Kubernetes cluster. The test consisted of spinning up a small Kubernetes cluster, installing Loki using the helm chart and breaking the ClusterRoleBinding to the serviceaccount, to receive a 403 status code.
I labeled the pods with
sidecar_version
to more easily distinguish between the log rates:Query:
sum by(level, sidecar_version) (count_over_time({container="loki-sc-rules"} | json [$__auto]))
After changing to the patched image, the rate of
ERROR
logs is reduced from 200-300/sec to about 2/5sec.