Closed allidoiswin10 closed 1 year ago
Runbook KB - https://github.com/prometheus-operator/runbooks/blob/main/content/runbooks/kubernetes/KubeAPIErrorBudgetBurn.md
I understand this alert essentially indicates that our Kube APIServer either has some erroring requests or responses that are taking too long. The Kube APIServer has a "budget" of errors that is allowed, the ideology behind the burn rate (how quickly a service consumes the error budget, the higher the burn rate, the more budget it eats) coming from https://sre.google/workbook/alerting-on-slos/#recommended_time_windows_and_burn_rates_f.
The runbook KB alludes that the window of the alert - long: 3d and short: 6h is the least concerning when it comes to the KubeAPIErrorBudgetBurn alert, but still worth investigating.
I followed the queries on the KB and the only query which returns anything is the read requests, resource scoped. Removing the sum from the example PromQL shows:
sort(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",scope=~"resource|",verb=~"LIST|GET"}[3d]))
{component="apiserver", endpoint="https", group="coordination.k8s.io", instance="10.60.16.137:6443", job="apiserver", le="1", namespace="default", resource="leases", scope="resource", service="kubernetes", verb="GET", version="v1"}{component="apiserver", endpoint="https", group="coordination.k8s.io", instance="10.60.32.137:6443", job="apiserver", le="1", namespace="default", resource="leases", scope="resource", service="kubernetes", verb="GET", version="v1"}
On the etcd side, I've noticed quite a few "apply request took too long" warnings. I assume the long apply requests could lead to eventual API budget burn.
{"level":"warn","ts":"2022-05-26T07:13:20.661Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"115.948664ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/secrets/\" range_end:\"/registry/secrets0\" limit:500 ","response":"range_response_count:236 size:10629157"}
According to https://etcd.io/docs/v3.3/faq/ - this could be related to network and/or disk IO. I analysed some of the network and disk IO, only the Azure node network IO looked alarming but somewhat expected.
Given the lack of transparency in the errors, I'm starting to think it's a low level hardware problem somewhere. However, I've not been able to pinpoint.
We've resolved this issue. It wasn't the runbook at fault.
If anyone's interested the issue was: An extension/aggregated API - calico API. We had a calico API pod running in Azure and were using VXLAN cross subnet. Which means if a Control Plane within the same VNet, tries to contact the Calico API on a neighbouring Azure Worker node, IP-in-IP is used. Which means Layer 2, Azure does not support this type of setup - Azure Article. Once we changed the encapsulation to always use VXLAN, even within subnet. This alert was resolved. The problematic network setup lead to bad kube API requests, thus leading to budget burn.
Reading the KubeAPIErrorBudgetBurn.md, it's not very clear what the end user must do once we've run the slow read requests queries.
For example, our cluster is alerting KubeAPIErrorBudgetBurn, running the queries in the markdown shows results for the resource scoped query. What's next?
Is there a query we can run to pinpoint what's causing the issue? The grafana API dashboard isn't overly clear. I've checked grafana for compute usage on all the kube-system pods and none are having limit issues. Similar all the control plane hosts are fine.
Logs from API server:
etcd pod logs. Used etcdctl to check the etcd health of the cluster (we're using a stacked etcd cluster) and all reported healthy.