kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.21k stars 1.93k forks source link

Kube-state-metrics 20x spikes in memory usage at restart #2302

Closed zhoujoetan closed 4 months ago

zhoujoetan commented 5 months ago

What happened: A few of our kube-state-metrics instance (single-instance, no sharding) recently had OOM issues after restart. The memory usage spiked up to 2.5GB (see attachment) for a few minutes before stabilized at 131 MB. We tried to increase the CPU limit from the default 0.1 to 1 or even 5, but it does not seem to help much.

Here is the pprof profile I captured:

File: kube-state-metrics
Type: inuse_space
Time: Jan 11, 2024 at 4:12pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1503.72MB, 99.10% of 1517.33MB total
Dropped 50 nodes (cum <= 7.59MB)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
  753.11MB 49.63% 49.63%   753.11MB 49.63%  io.ReadAll
  748.12MB 49.30% 98.94%   748.12MB 49.30%  k8s.io/apimachinery/pkg/runtime.(*Unknown).Unmarshal
    2.49MB  0.16% 99.10%     8.99MB  0.59%  k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add
         0     0% 99.10%   753.11MB 49.63%  io/ioutil.ReadAll (inline)
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).Decode
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
         0     0% 99.10%  1502.32MB 99.01%  k8s.io/client-go/kubernetes/typed/core/v1.(*configMaps).List
         0     0% 99.10%   753.61MB 49.67%  k8s.io/client-go/rest.(*Request).Do

Looks like heap memory usage does not represent 100% of the container_memory_usage_bytes metric.

What you expected to happen: memory usage to not spike 20x at restart

How to reproduce it (as minimally and precisely as possible): Kill/restart the KSM pod.

Anything else we need to know?:

Environment:

Untitled

dgrisonnet commented 5 months ago

/triage accepted /assign @rexagod

mindw commented 4 months ago

@zhoujoetan try excluding configmaps and secrets from the list of exported resources (--resources= command line option). At least for me it dropped initial memory usage from ~400Mib to 24Mib.

Both CLI and helm chart have them included by default.

In my case Helm charts (which store the manifests in secrets by defaults) history were the main culprit for the I could not confirm is pagination is used which may mitigate this issue.

Hope this helps.

rexagod commented 4 months ago

kube-state-metrics version: v2.3.0

@zhoujoetan It seems you're on an outdated version that's no longer supported. Could you switch to one of the supported versions (preferably the latest release) and verify this issue still persists for you?

zhoujoetan commented 4 months ago

I have figured out the issue. We had a ton of configmaps objects that KSM read during startup time. Trimming those objects brought the memory usage back down. I am closing the issue now.

nalshamaajc commented 3 months ago

@zhoujoetan When you sam Configmap objects, you mean this cluste-rwide or was it something specific? Also when you say trimming was it like deleting unwanted CMs, or removing data from these CMs?