prometheus-operator / runbooks

https://runbooks.prometheus-operator.dev
Apache License 2.0
93 stars 172 forks source link

KubeAPIErrorBudgetBurn - Troubleshooting #36

Closed allidoiswin10 closed 1 year ago

allidoiswin10 commented 2 years ago

Reading the KubeAPIErrorBudgetBurn.md, it's not very clear what the end user must do once we've run the slow read requests queries.

For example, our cluster is alerting KubeAPIErrorBudgetBurn, running the queries in the markdown shows results for the resource scoped query. What's next?

Is there a query we can run to pinpoint what's causing the issue? The grafana API dashboard isn't overly clear. I've checked grafana for compute usage on all the kube-system pods and none are having limit issues. Similar all the control plane hosts are fine.

Environment config: 3 control planes 5 nodes
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Prometheus chart version - 18.0.0

Logs from API server:

19:05:52.870145 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
19:05:52.870196 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
19:05:52.871269 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:05:52.883973 1 timeout.go:135] post-timeout activity - time-elapsed: 47.79694ms, GET "/api/v1/namespaces" result: <nil>
19:06:45.153567 1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/pods?fieldSelector=spec.nodeName%3D<hostname>&limit=500&resourceVersion=0" audit-ID="0b71fe10-7ade-4a2d-a42e-d279ba403f84"
19:06:45.155852 1 timeout.go:135] post-timeout activity - time-elapsed: 1.857692ms, GET "/api/v1/pods" result: <nil>
19:07:03.529750 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
19:07:03.529815 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
19:07:03.530834 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:07:03.532075 1 timeout.go:135] post-timeout activity - time-elapsed: 10.29388ms, GET "/api/v1/namespaces" result: <nil>
19:08:22.935750 1 writers.go:111] apiserver was unable to close cleanly the response writer: write tcp 10.60.16.137:6443->10.60.32.139:5533: write: broken pipe
19:08:22.947004 1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/pods?fieldSelector=spec.nodeName%3D<hostname>&limit=500&resourceVersion=0" audit-ID="511102dd-077c-4889-844e-440bf18f152d"
19:08:22.949947 1 timeout.go:135] post-timeout activity - time-elapsed: 406.534µs, GET "/api/v1/pods" result: <nil>
19:13:05.754212 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
19:13:05.754316 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
19:13:05.756386 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:13:05.762336 1 timeout.go:135] post-timeout activity - time-elapsed: 32.037831ms, GET "/api/v1/pods" result: <nil>
19:13:17.486545 1 timeout.go:135] post-timeout activity - time-elapsed: 57.430085ms, GET "/version" result: <nil>
19:14:45.087724 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
19:14:45.089258 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
19:14:45.091079 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:14:45.101729 1 timeout.go:135] post-timeout activity - time-elapsed: 21.970146ms, GET "/api/v1/pods" result: <nil>
19:16:25.726932 1 writers.go:117] apiserver was unable to write a JSON response: client disconnected
19:16:25.727139 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"client disconnected"}: client disconnected
19:16:25.727896 1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/pods?fieldSelector=spec.nodeName%3D<hostname>&limit=500&resourceVersion=0" audit-ID="fbad31c9-780f-4964-a7ce-ee27e477ed07"
19:16:25.734406 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:16:25.745880 1 timeout.go:135] post-timeout activity - time-elapsed: 17.763979ms, GET "/api/v1/pods" result: <nil>
19:19:45.199374 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
19:19:45.199483 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
19:19:45.201383 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:19:45.217497 1 timeout.go:135] post-timeout activity - time-elapsed: 23.035919ms, GET "/api/v1/pods" result: <nil>
19:23:23.481610 1 writers.go:117] apiserver was unable to write a JSON response: write tcp 10.60.16.137:6443->10.60.16.139:25815: write: broken pipe
19:23:23.481665 1 status.go:71] apiserver received an error that is not an metav1.Status: &tls.permanentError{err:(*net.OpError)(0xc024b23e00)}: write tcp 10.60.16.137:6443->10.60.16.139:25815: write: broken pipe
19:23:23.481917 1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/namespaces?limit=500&resourceVersion=0" audit-ID="3b14d48f-fc62-4aa9-9241-c8262bb003e2"
19:23:23.482749 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:23:23.485136 1 timeout.go:135] post-timeout activity - time-elapsed: 3.166382ms, GET "/api/v1/namespaces" result: <nil>
19:28:45.184946 1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/pods?fieldSelector=spec.nodeName%3D<hostname>&limit=500&resourceVersion=0" audit-ID="f8cffaa0-b3dc-4bf8-91b8-5413dc992e4e"
19:28:45.194913 1 timeout.go:135] post-timeout activity - time-elapsed: 9.201799ms, GET "/api/v1/pods" result: <nil>
19:30:42.929752 1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/pods?fieldSelector=spec.nodeName%3D<hostname>&limit=500&resourceVersion=0" audit-ID="4cf6f2cb-17af-459a-8cc0-b82dec364877"
19:30:42.934026 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
19:30:42.934110 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
19:30:42.935606 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:30:42.943599 1 timeout.go:135] post-timeout activity - time-elapsed: 11.701719ms, GET "/api/v1/pods" result: <nil>
19:32:04.874772 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
19:32:04.874833 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
19:32:04.876696 1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
19:32:04.879213 1 timeout.go:135] post-timeout activity - time-elapsed: 4.399361ms, GET "/api/v1/namespaces" result: <nil>

etcd pod logs. Used etcdctl to check the etcd health of the cluster (we're using a stacked etcd cluster) and all reported healthy.

{"level":"info","ts":"2022-05-23T20:20:01.331Z","caller":"traceutil/trace.go:171","msg":"trace[1987616546] linearizableReadLoop","detail":"{readStateIndex:179020134; appliedIndex:179020134; }","duration":"284.221241ms","start":"2022-05-23T20:20:01.047Z","end":"2022-05-23T20:20:01.331Z","steps":["trace[1987616546] 'read index received' (duration: 284.21457ms)","trace[1987616546] 'applied index is now lower than readState.Index' (duration: 4.855µs)"],"step_count":2}
{"level":"warn","ts":"2022-05-23T20:20:01.331Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"284.402146ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/kube-system/kube-scheduler\" ","response":"range_response_count:1 size:491"}
{"level":"info","ts":"2022-05-23T20:20:01.331Z","caller":"traceutil/trace.go:171","msg":"trace[269301427] range","detail":"{range_begin:/registry/leases/kube-system/kube-scheduler; range_end:; response_count:1; response_revision:151649493; }","duration":"284.457985ms","start":"2022-05-23T20:20:01.047Z","end":"2022-05-23T20:20:01.331Z","steps":["trace[269301427] 'agreement among raft nodes before linearized reading' (duration: 284.345825ms)"],"step_count":1}
{"level":"info","ts":"2022-05-23T20:20:36.775Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151648148}
{"level":"info","ts":"2022-05-23T20:20:36.829Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151648148,"took":"50.399602ms"}
{"level":"warn","ts":"2022-05-23T20:21:29.693Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"106.173125ms","expected-duration":"100ms","prefix":"","request":"header:<ID:6874605100731549344 username:\"kube-apiserver-etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/leases/kube-system/cert-manager-cainjector-leader-election\" mod_revision:151649931 > success:<request_put:<key:\"/registry/leases/kube-system/cert-manager-cainjector-leader-election\" value_size:465 >> failure:<>>","response":"size:20"}
{"level":"info","ts":"2022-05-23T20:25:36.805Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151649671}
{"level":"info","ts":"2022-05-23T20:25:36.854Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151649671,"took":"45.892ms"}
{"level":"info","ts":"2022-05-23T20:26:29.709Z","caller":"traceutil/trace.go:171","msg":"trace[820645530] linearizableReadLoop","detail":"{readStateIndex:179022350; appliedIndex:179022350; }","duration":"115.998253ms","start":"2022-05-23T20:26:29.593Z","end":"2022-05-23T20:26:29.709Z","steps":["trace[820645530] 'read index received' (duration: 115.991846ms)","trace[820645530] 'applied index is now lower than readState.Index' (duration: 4.964µs)"],"step_count":2}
{"level":"warn","ts":"2022-05-23T20:26:29.709Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"116.637732ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/configmaps/kube-system/cert-manager-cainjector-leader-election\" ","response":"range_response_count:1 size:656"}
{"level":"info","ts":"2022-05-23T20:26:29.709Z","caller":"traceutil/trace.go:171","msg":"trace[339868800] range","detail":"{range_begin:/registry/configmaps/kube-system/cert-manager-cainjector-leader-election; range_end:; response_count:1; response_revision:151651470; }","duration":"116.866592ms","start":"2022-05-23T20:26:29.593Z","end":"2022-05-23T20:26:29.709Z","steps":["trace[339868800] 'agreement among raft nodes before linearized reading' (duration: 116.576002ms)"],"step_count":1}
{"level":"info","ts":"2022-05-23T20:30:36.836Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151651205}
{"level":"info","ts":"2022-05-23T20:30:36.891Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151651205,"took":"52.376845ms"}
{"level":"warn","ts":"2022-05-23T20:32:31.190Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"109.16284ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/secrets/\" range_end:\"/registry/secrets0\" limit:500 ","response":"range_response_count:236 size:10629157"}
{"level":"info","ts":"2022-05-23T20:32:31.190Z","caller":"traceutil/trace.go:171","msg":"trace[1123711657] range","detail":"{range_begin:/registry/secrets/; range_end:/registry/secrets0; response_count:236; response_revision:151653359; }","duration":"109.38857ms","start":"2022-05-23T20:32:31.081Z","end":"2022-05-23T20:32:31.190Z","steps":["trace[1123711657] 'range keys from bolt db' (duration: 104.671827ms)"],"step_count":1}
{"level":"info","ts":"2022-05-23T20:33:35.843Z","caller":"etcdserver/server.go:1368","msg":"triggering snapshot","local-member-id":"e139a5aaf50ac130","local-member-applied-index":179024839,"local-member-snapshot-index":179014838,"local-member-snapshot-count":10000}
{"level":"info","ts":"2022-05-23T20:33:35.859Z","caller":"etcdserver/server.go:2363","msg":"saved snapshot","snapshot-index":179024839}
{"level":"info","ts":"2022-05-23T20:33:35.860Z","caller":"etcdserver/server.go:2393","msg":"compacted Raft logs","compact-index":179019839}
{"level":"info","ts":"2022-05-23T20:34:00.414Z","caller":"fileutil/purge.go:77","msg":"purged","path":"/var/lib/etcd/member/snap/0000000000000b6f-000000000aaaf072.snap"}
{"level":"info","ts":"2022-05-23T20:35:36.856Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151652782}
{"level":"info","ts":"2022-05-23T20:35:36.914Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151652782,"took":"48.097214ms"}
{"level":"info","ts":"2022-05-23T20:40:36.891Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151654311}
{"level":"info","ts":"2022-05-23T20:40:36.939Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151654311,"took":"45.260074ms"}
{"level":"info","ts":"2022-05-23T20:45:36.910Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151655847}
{"level":"info","ts":"2022-05-23T20:45:36.957Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151655847,"took":"43.940414ms"}
{"level":"info","ts":"2022-05-23T20:50:36.938Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151657421}
{"level":"info","ts":"2022-05-23T20:50:36.996Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151657421,"took":"47.139237ms"}
{"level":"info","ts":"2022-05-23T20:55:36.976Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151658954}
{"level":"info","ts":"2022-05-23T20:55:37.026Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151658954,"took":"44.93328ms"}
{"level":"info","ts":"2022-05-23T21:00:36.987Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151660487}
{"level":"info","ts":"2022-05-23T21:00:37.043Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151660487,"took":"50.938414ms"}
{"level":"info","ts":"2022-05-23T21:02:16.017Z","caller":"etcdserver/server.go:1368","msg":"triggering snapshot","local-member-id":"e139a5aaf50ac130","local-member-applied-index":179034840,"local-member-snapshot-index":179024839,"local-member-snapshot-count":10000}
{"level":"info","ts":"2022-05-23T21:02:16.029Z","caller":"etcdserver/server.go:2363","msg":"saved snapshot","snapshot-index":179034840}
{"level":"info","ts":"2022-05-23T21:02:16.030Z","caller":"etcdserver/server.go:2393","msg":"compacted Raft logs","compact-index":179029840}
{"level":"info","ts":"2022-05-23T21:02:30.470Z","caller":"fileutil/purge.go:77","msg":"purged","path":"/var/lib/etcd/member/snap/0000000000000b6f-000000000aab1783.snap"}
{"level":"warn","ts":"2022-05-23T21:03:36.903Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"110.989203ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/secrets/\" range_end:\"/registry/secrets0\" limit:500 ","response":"range_response_count:236 size:10629157"}
{"level":"info","ts":"2022-05-23T21:03:36.905Z","caller":"traceutil/trace.go:171","msg":"trace[628444962] range","detail":"{range_begin:/registry/secrets/; range_end:/registry/secrets0; response_count:236; response_revision:151663021; }","duration":"111.485416ms","start":"2022-05-23T21:03:36.792Z","end":"2022-05-23T21:03:36.904Z","steps":["trace[628444962] 'range keys from bolt db' (duration: 106.670357ms)"],"step_count":1}
{"level":"info","ts":"2022-05-23T21:05:37.022Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":151662078}
{"level":"info","ts":"2022-05-23T21:05:37.074Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":151662078,"took":"48.220268ms"}
allidoiswin10 commented 2 years ago

Runbook KB - https://github.com/prometheus-operator/runbooks/blob/main/content/runbooks/kubernetes/KubeAPIErrorBudgetBurn.md

I understand this alert essentially indicates that our Kube APIServer either has some erroring requests or responses that are taking too long. The Kube APIServer has a "budget" of errors that is allowed, the ideology behind the burn rate (how quickly a service consumes the error budget, the higher the burn rate, the more budget it eats) coming from https://sre.google/workbook/alerting-on-slos/#recommended_time_windows_and_burn_rates_f.

The runbook KB alludes that the window of the alert - long: 3d and short: 6h is the least concerning when it comes to the KubeAPIErrorBudgetBurn alert, but still worth investigating.

I followed the queries on the KB and the only query which returns anything is the read requests, resource scoped. Removing the sum from the example PromQL shows:

sort(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",scope=~"resource|",verb=~"LIST|GET"}[3d]))

{component="apiserver", endpoint="https", group="coordination.k8s.io", instance="10.60.16.137:6443", job="apiserver", le="1", namespace="default", resource="leases", scope="resource", service="kubernetes", verb="GET", version="v1"}{component="apiserver", endpoint="https", group="coordination.k8s.io", instance="10.60.32.137:6443", job="apiserver", le="1", namespace="default", resource="leases", scope="resource", service="kubernetes", verb="GET", version="v1"} 

On the etcd side, I've noticed quite a few "apply request took too long" warnings. I assume the long apply requests could lead to eventual API budget burn.

{"level":"warn","ts":"2022-05-26T07:13:20.661Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"115.948664ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/secrets/\" range_end:\"/registry/secrets0\" limit:500 ","response":"range_response_count:236 size:10629157"} 

According to https://etcd.io/docs/v3.3/faq/ - this could be related to network and/or disk IO. I analysed some of the network and disk IO, only the Azure node network IO looked alarming but somewhat expected.

Image - https://preview.redd.it/703nqn93jm291.png?width=929&format=png&auto=webp&s=c6e104d189c2d2a9ea44cd5a67eb67a4d606a83f

Given the lack of transparency in the errors, I'm starting to think it's a low level hardware problem somewhere. However, I've not been able to pinpoint.

allidoiswin10 commented 1 year ago

We've resolved this issue. It wasn't the runbook at fault.

If anyone's interested the issue was: An extension/aggregated API - calico API. We had a calico API pod running in Azure and were using VXLAN cross subnet. Which means if a Control Plane within the same VNet, tries to contact the Calico API on a neighbouring Azure Worker node, IP-in-IP is used. Which means Layer 2, Azure does not support this type of setup - Azure Article. Once we changed the encapsulation to always use VXLAN, even within subnet. This alert was resolved. The problematic network setup lead to bad kube API requests, thus leading to budget burn.