kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.
Apache License 2.0
2.1k stars 599 forks source link

KubeAPIErrorBudgetBurn Alert Reason #615

Open d-m opened 3 years ago

d-m commented 3 years ago

Hello all,

I was hoping that someone might be able to help me with understanding why the KubeAPIErrorBudgetBurn alert (long: 3d, short 6h) was firing.

I reviewed the API Server dashboard and noticed that there were large spikes for an entry with no resource label:

Screen Shot 2021-05-28 at 10 16 13 PM

The dashboard uses the query cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{verb="read"}.

I also read through https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/464 and the very helpful runbook mentioned in a comment in that ticket. The only example query in the runbook that returned any results was the resource scoped slow read request query but it didn't have a resource name, either:

Screen Shot 2021-05-28 at 10 22 54 PM

Any suggestions for next steps would be appreciated.

Thanks.

metalmatze commented 3 years ago

You should be able to figure out the slow resource be removing the sum() to only have the rate which won't aggregate anymore.

paulfantom commented 3 years ago

We might want to link to https://github.com/prometheus-operator/kube-prometheus/wiki/KubeAPIErrorBudgetBurn somewhere and/or improve it.

mihail-velikov commented 3 years ago

Hello everyone,

Since last week I also started getting this alert and I am pretty much clueless on how to proceed.

Our cluster setup is deployed using kubespray on: k8s 1.19.2 OS: ubuntu 20.04 3 master - 4 CPU/16GB RAM 20 workers - 8 CPU/64 GB RAM All this is hosted on premise with vmware as the underlying hypervisor and calico as the network plugin with VXLAN and ipinip disabled. The master nodes are disabled for scheduling and thus run only the cluster components + etcd.

Looking at the API dashboard I noticed that we have slow Write SLI queries:

Screenshot 2021-08-26 at 13 31 50

The two slow queries seems to be related to "ingress" and "pods". I checked the API logs and I saw that some "Patch" events for ingress take very long time. Example: (I0826 10:17:07.864302 1 trace.go:205] Trace[162001560]: "Patch" url:/apis/extensions/v1beta1/namespaces/ews-int/ingresses/ews-int-redis-commander-generic,user-agent:kubectl/v1.21.0 (linux/amd64) kubernetes/cb303e6,client:172.17.42.247 (26-Aug-2021 10:16:59.314) (total time: 8549ms):

I have the suspicion that this is related to the old API endpoint "apis/extensions/v1beta1/" and I will double check that by removing this specific ingresses. I have already checked node CPU/RAM usage on the masters and it is very low. I have also checked the etcd logs and it doesn't have any obvious issues - no slow queries/disc sync/etc.

Regarding the slow pod write queries - I have no idea how to further investigate this besides enabling "profiling" for the api server.

Any hints will be greatly appreciated.

Kind Regards, Mihail Velikov

mihail-velikov commented 3 years ago

Update: It seems that my suspicion was incorrect. We updated all ingress endpoints to the latest version of the api but the problem persists. Additionally I tried enabling the profiling of the api-server but not much more information was added in the logs about the slow requests.

povilasv commented 3 years ago

One approach to use tracing if you would run newer k8s version -> https://kubernetes.io/blog/2021/09/03/api-server-tracing/