Closed Pokom closed 1 week ago
:wave: We're also seeing this on some very small clusters that we just upgraded to k8s 1.29
(3-4 nodes, less than 100 pods).
For us....
2.10.1
through 2.12.0
)Maybe we've just been getting lucky, until now?
@tkent I've created a PR(#2412) that I believe resolves the root cause. By adding timeouts to the metric servers requests, it ensures clients that hang close in a timely fashion and don't block other writes.
@Pokom - thanks for both the write up and addressing it in a PR!
I'm still puzzled about why we only started seeing this on clusters after upgrading them from 1.27
-> 1.29
, and why it happens pretty much all the time (the pods are unusable for us in each cluster we've upgraded). But, that's probably something ignorable about our configurations.
Anyway, just recording it here in case somebody else happens across it. Looking forward to your PR making it in!
Update
While our symptoms match up exactly with the issue described here, the frequency was much higher. We found that the cause of our hangs turned out to be different (a bit related, but different). During our 1.27
-> 1.29
k8s upgrades, we also upgraded our cilium installations and some problem with the transparent IPSEC feature in the new version caused larger communications (those over about 200K) to frequently stall. The only thing regularly doing that size of communication in our small test clusters was ksm
pods, and that's how we ended up here. For now, we've disabled cilium's IPSEC feature and we're back to working again.
@tkent Appreciate confirming that you're experiencing the same symptoms I am! I'm running dozens of ksm
instances that are sharded, so we have probably close to 100 pods running out in the wild, and it's very sporadic when it happens. Even when it occurs, it's usually only a few shards at once, never all of the shards. So I believe the problem has existed for quite some time, it's just rare enough to occur that it's transient.
/triage accepted /help wanted
/assign @Pokom
What happened:
Occasionally in larger clusters,
kube-state-metrics
will fail to be scraped by both Prometheus and Grafana's Alloy. When attempting to curl the/metrics
endpoint we'll get to a certain point(usually pod disruption budgets) and then just hangs. The only way to recover from the scenario is to restart thekube-state-metric
pods impacted. This is also similar to #995 and #1028, but the difference is we're not having a high churn on pods. Our ksm deployment is sharded and not all of the shards will fail at the same time.What you expected to happen: I would expect the server to eventually timeout the connections, and not block future scrapes from being blocked.
How to reproduce it (as minimally and precisely as possible):
Run the tip of
kube-state-metrics
and have it access a decently large cluster. Withkube-state-metrics
up and running, launch the following go program which is meant to emulate a client failing to fully fetch metrics:After a few clients fail to close their body, accessing the
/metrics
endpoint will stall and you'll get partial results from curl. If you look at/debug/pprof/goroutines
you'll notice that the number of goroutines will continue increasing. You'll find a single goroutine that's blocking all the write goroutines that looks like:Anything else we need to know?:
Here are goroutine dumps that show that all goroutines are stuck and blocked on reading. ksm-groutine-dump-2024-06-04-regular.txt ksm-groutine-dump-2024-06-04-debug-2.txt
Environment:
v2.12.0
kubectl version
):v1.28.7-gke.1026000