kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.2k stars 1.92k forks source link

Scrapes hang when deadlocked #2413

Closed Pokom closed 1 week ago

Pokom commented 3 weeks ago

What happened:

Occasionally in larger clusters, kube-state-metrics will fail to be scraped by both Prometheus and Grafana's Alloy. When attempting to curl the /metrics endpoint we'll get to a certain point(usually pod disruption budgets) and then just hangs. The only way to recover from the scenario is to restart the kube-state-metric pods impacted. This is also similar to #995 and #1028, but the difference is we're not having a high churn on pods. Our ksm deployment is sharded and not all of the shards will fail at the same time.

What you expected to happen: I would expect the server to eventually timeout the connections, and not block future scrapes from being blocked.

How to reproduce it (as minimally and precisely as possible):

Run the tip of kube-state-metrics and have it access a decently large cluster. With kube-state-metrics up and running, launch the following go program which is meant to emulate a client failing to fully fetch metrics:

package main

import (
    "fmt"
    "log"
    "net/http"
    "sync"
    "time"
)

func main() {
    wg := sync.WaitGroup{}
    wg.Add(25)
    for i := 0; i < 5; i++ {
        for j := 0; j < 5; j++ {
            go func() {
                defer wg.Done()
                now := time.Now()
                c := http.Client{}
                req, err := http.NewRequest("GET", "http://localhost:8080/metrics", nil)
                if err != nil {
                    log.Printf("Error creating request: %s", err)
                    return
                }
                req.Header.Set("Range", "bytes=0-500")
                resp, err := c.Do(req)
                if err != nil {
                    fmt.Printf("Error sending request: %s", err)
                }
                // _, err = io.Copy(ioutil.Discard, resp.Body)

                log.Printf("Downloaded %d bytes in %fs", resp.ContentLength, time.Since(now).Seconds())
            }()
        }
        time.Sleep(15 * time.Second)
    }
    wg.Wait()
    log.Println("Done, going to sleep for a bits")
    time.Sleep(1 * time.Minute)
}

After a few clients fail to close their body, accessing the /metrics endpoint will stall and you'll get partial results from curl. If you look at /debug/pprof/goroutines you'll notice that the number of goroutines will continue increasing. You'll find a single goroutine that's blocking all the write goroutines that looks like:

goroutine 101 [sync.RWMutex.Lock, 17 minutes]:
sync.runtime_SemacquireRWMutex(0xc002454000?, 0xb0?, 0x0?)
    /usr/local/go/src/runtime/sema.go:87 +0x25
sync.(*RWMutex).Lock(0x1ab02c0?)
    /usr/local/go/src/sync/rwmutex.go:152 +0x6a
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc00052e140, {0x1ab02c0, 0xc002454000})
    /go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:68 +0x74
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0x1ab02c0?, {0x1ab02c0?, 0xc002454000?})
    /go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:86 +0x1d
k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0x2abb480?}, {0x1d6e550, 0xc004995b00}, {0x1d86780, 0xc00052e140}, {0x1d93688?, 0x1ab02c0}, 0x0, ...)
    /go/pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:761 +0x91b

Anything else we need to know?:

Here are goroutine dumps that show that all goroutines are stuck and blocked on reading. ksm-groutine-dump-2024-06-04-regular.txt ksm-groutine-dump-2024-06-04-debug-2.txt

Environment:

tkent commented 2 weeks ago

:wave: We're also seeing this on some very small clusters that we just upgraded to k8s 1.29 (3-4 nodes, less than 100 pods).

For us....

  1. It's happening on all versions we've tested (2.10.1 through 2.12.0)
  2. We didn't notice this happening until after the upgrade
  3. We don't have a clear cause for it yet, but the symptom is the same as described here and can be reproduced by just scraping in a for loop for a bit.

Maybe we've just been getting lucky, until now?

Pokom commented 2 weeks ago

@tkent I've created a PR(#2412) that I believe resolves the root cause. By adding timeouts to the metric servers requests, it ensures clients that hang close in a timely fashion and don't block other writes.

tkent commented 2 weeks ago

@Pokom - thanks for both the write up and addressing it in a PR!

I'm still puzzled about why we only started seeing this on clusters after upgrading them from 1.27 -> 1.29, and why it happens pretty much all the time (the pods are unusable for us in each cluster we've upgraded). But, that's probably something ignorable about our configurations.

Anyway, just recording it here in case somebody else happens across it. Looking forward to your PR making it in!


Update

While our symptoms match up exactly with the issue described here, the frequency was much higher. We found that the cause of our hangs turned out to be different (a bit related, but different). During our 1.27 -> 1.29 k8s upgrades, we also upgraded our cilium installations and some problem with the transparent IPSEC feature in the new version caused larger communications (those over about 200K) to frequently stall. The only thing regularly doing that size of communication in our small test clusters was ksm pods, and that's how we ended up here. For now, we've disabled cilium's IPSEC feature and we're back to working again.

Pokom commented 2 weeks ago

@tkent Appreciate confirming that you're experiencing the same symptoms I am! I'm running dozens of ksm instances that are sharded, so we have probably close to 100 pods running out in the wild, and it's very sporadic when it happens. Even when it occurs, it's usually only a few shards at once, never all of the shards. So I believe the problem has existed for quite some time, it's just rare enough to occur that it's transient.

dgrisonnet commented 2 weeks ago

/triage accepted /help wanted

dgrisonnet commented 2 weeks ago

/assign @Pokom