kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.21k stars 1.94k forks source link

kube-state-metrics consuming too much memory #257

Closed jac-stripe closed 5 years ago

jac-stripe commented 6 years ago

kube-state-metrics is using >400mb of RAM. It is also very slow when I query /metrics. The kubernetes cluster has 2700 job objects. It seems surprising that this would consume 400mb of RAM for metrics aggregation. Below is a pprof top trace. This is running the latest git revision (d316c013fae8965bfb75bafda9453ca2ef54c48f)

(pprof) top
Showing nodes accounting for 526.72MB, 86.90% of 606.14MB total
Dropped 148 nodes (cum <= 3.03MB)
Showing top 10 nodes out of 110
      flat  flat%   sum%        cum   cum%
  195.01MB 32.17% 32.17%   202.01MB 33.33%  github.com/prometheus/client_golang/prometheus.makeLabelPairs
  101.26MB 16.71% 48.88%   148.26MB 24.46%  github.com/prometheus/client_golang/prometheus.(*Registry).Gather
   74.28MB 12.26% 61.13%    74.81MB 12.34%  k8s.io/kube-state-metrics/collectors.RegisterJobCollector.func1                                                                 47MB  7.75% 68.89%       47MB  7.75%  github.com/prometheus/client_golang/prometheus.populateMetric
   27.60MB  4.55% 73.44%    30.60MB  5.05%  k8s.io/client-go/pkg/api/v1.codecSelfer1234.decSliceVolume
   23.01MB  3.80% 77.24%    23.01MB  3.80%  runtime.rawstringtmp
   18.97MB  3.13% 80.37%    19.55MB  3.22%  github.com/golang/protobuf/proto.(*Buffer).EncodeStringBytes
   15.50MB  2.56% 82.92%   217.51MB 35.88%  github.com/prometheus/client_golang/prometheus.NewConstMetric
   13.50MB  2.23% 85.15%    14.02MB  2.31%  runtime.mapassign
   10.58MB  1.74% 86.90%    12.71MB  2.10%  compress/flate.NewWriter
brancz commented 6 years ago

What does "very" slow mean? Up to 10 seconds response time wouldn't be unusual for a huge request.

For every job object there are at least 12 metrics being reported plus 20 metrics for each of the pods created by those Job objects. 2700 12 = 32400 + 2700 20 = 86400. With minimum of 86400 lines of metrics per HTTP request, those numbers actually don't seem too unreasonable, although we are aware of some inefficiencies of the Prometheus Go implementation, that primarily drive these numbers up.

It might be worth checking thought that you're not running into the same problem as reported here: https://github.com/kubernetes/kube-state-metrics/issues/112#issuecomment-328302552

julia-stripe commented 6 years ago

thanks so much @brancz! "very slow' means that kube-state-metrics' /metrics endpoint often doesn't respond to HTTP requests at all (even after 10 minutes or so).

can you say offhand what some of the inefficiencies of the Prometheus Go implementation are? That could help us debug.

brancz commented 6 years ago

The inefficiencies are that there are a number of allocations that could be optimized, but it wouldn't explain why the HTTP requests don't respond at all, the [scalability tests]()https://github.com/kubernetes/kube-state-metrics/issues/124#issuecomment-318394185 that Google ran had 1000 nodes and 30000 pods and responded within 9s, used ~1Gb memory and 0.3 cores. The number of metrics for those tests should be far more than in this case, so I feel it might actually be something in the Job collector. The memory usage is probably just end up showing in the Prometheus client code as we're still creating metrics, so memory profiles of alloc_space and inuse_space would be helpful. Could you take those with go tool pprof and share the bundles that it drops in $HOME/pprof? If you analyze it even better :wink:.

matthiasr commented 6 years ago

Are there a lot of Completed pods in the cluster? Deployments usually don't leave those around but I think Job does, or did back when I experimented with it.

kube-state-metrics essentially holds the whole cluster state in memory, so I can imagine that while these pods don't "do" anything they bring it to its knees.

On Fri, Sep 15, 2017, 18:32 Frederic Branczyk notifications@github.com wrote:

The inefficiencies are that there are a number of allocations that could be optimized, but it wouldn't explain why the HTTP requests don't respond at all, the scalability tests#124 (comment) https://github.com/kubernetes/kube-state-metrics/issues/124#issuecomment-318394185 that Google ran had 1000 nodes and 30000 pods and responded within 9s, used ~1Gb memory and 0.3 cores. The number of metrics for those tests should be far more than in this case, so I feel it might actually be something in the Job collector. The memory usage is probably just end up showing in the Prometheus client code as we're still creating metrics, so memory profiles of alloc_space and inuse_space would be helpful. Could you take those with go tool pprof and share the bundles that it drops in $HOME/pprof? If you analyze it even better 😉.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kube-state-metrics/issues/257#issuecomment-329833050, or mute the thread https://github.com/notifications/unsubscribe-auth/AAICBpqzu1plyK-bOZAtRqv-njS0pf_Uks5siqaIgaJpZM4PYXve .

brancz commented 6 years ago

Yep that's what I linked to earlier, it's my suspicion as well.

julia-stripe commented 6 years ago

Are there a lot of Completed pods in the cluster?

No, I double checked and there are only 200 completed pods in the cluster (we configure kube to garbage collect terminated pods to avoid exactly this problem)

andyxning commented 6 years ago

@julia-stripe Can you paste the log about how many job objects or pod objects scraped as implemented in #254 . We need to make sure what are the actual number of objects kube-state-metrics scraped.

brancz commented 6 years ago

Yes what @andyxning mentioned would indeed be helpful, and then the full memory profiles for further analysis :slightly_smiling_face: .

jac-stripe commented 6 years ago

I have attached profiles for alloc_space[1] and inuse_space[2]. Increasing the memory allocated to 2gb seems to help, giving a metrics response in ~25 seconds, however this seems like a lot of RAM for our cluster size. Below are some statistics on our cluster:

$ kubectl get pods --all-namespaces -a | wc -l
273
$ kubectl get jobs --all-namespaces -a | wc -l
2028
$ kubectl get cronjobs --all-namespaces -a | wc -l
193
brancz commented 6 years ago

That's definitely a lot, could you give us the number of lines in the metric response? It will tell us the number of time-series this is producing, which should be interesting given the response time and memory usage.

andyxning commented 6 years ago

@jac-stripe Thanks for your valuable info.

BTW, could you please paste the log about number of collected object in kube-state-metrics aside from the kubectl outputs. That is added in #254 which is mainly used to debug problems something like this one.

jac-stripe commented 6 years ago

@brancz the response is 16172 lines, totalling 2226kb. @andyxning here are some logs on collected objects within the last few seconds:

I0919 16:55:56.928875       1 deployment.go:155] collected 3 deployments
I0919 16:55:56.928954       1 replicaset.go:119] collected 4 replicasets
I0919 16:55:56.933851       1 cronjob.go:125] collected 216 cronjobs
I0919 16:55:57.090494       1 job.go:149] collected 665 jobs
I0919 16:55:57.180204       1 pod.go:238] collected 272 pods
I0919 16:55:57.378936       1 daemonset.go:113] collected 1 daemonsets
I0919 16:55:57.379160       1 deployment.go:155] collected 3 deployments
I0919 16:55:57.379251       1 replicaset.go:119] collected 4 replicasets
I0919 16:55:57.379274       1 limitrange.go:97] collected 0 limitranges
I0919 16:55:57.379655       1 namespace.go:110] collected 4 namespaces
I0919 16:55:57.379686       1 statefulset.go:121] collected 0 statefulsets
I0919 16:55:57.379855       1 replicationcontroller.go:125] collected 0 replicationcontrollers
I0919 16:55:57.380030       1 service.go:101] collected 4 services
I0919 16:55:57.380050       1 persistentvolumeclaim.go:107] collected 0 persistentvolumeclaims
I0919 16:55:57.380063       1 resourcequota.go:95] collected 0 resourcequotas
I0919 16:55:57.382270       1 node.go:181] collected 20 nodes
I0919 16:55:57.480281       1 job.go:149] collected 665 jobs
I0919 16:55:57.581042       1 cronjob.go:125] collected 216 cronjobs
I0919 16:55:57.979816       1 pod.go:238] collected 272 pods
smarterclayton commented 6 years ago

We've got about 10x that number of resources and are seeing total heap around 2.2GB:

memory for KSM

KSM /metrics has 929k series and is 102MB uncompressed.

smarterclayton commented 6 years ago

A couple of obvious ways to improve:

  1. switch to protobuf in the client (if it isn't already) for resources that support it
  2. switch to a flyweight pattern in the informers - create a shim object to place in the informer cache that contains only the fields necessary (and probably avoids maps, which are very expensive).
    1. Alternatively turn each object into a set of samples and cache those each time the object changes
brancz commented 6 years ago

@smarterclayton thanks for sharing your data and information! What is the response time for the call to the /metric endpoint?

  1. We haven't done any intentional switch to protobuf, but that seems like an easy first step. From what I can tell it's just a change to the content type passed to the rest client. How can I tell, which types do not have protobuf support?

  2. I haven't seen the "flyweight" pattern in regard to the informers, would this still allow the use of protobuf if we leave out fields?

  3. If I understand this correctly, you mean that we turn the pattern around and based on the events emitted from the Kubernetes API build exactly the metrics we need instead of creating them on every request? I think that's generally a good idea, just gets more complicated in regard to staleness of objects, but should still be manageable.

The first two attempts I can definitely see the returns, the third option seems like a nice to have, but a lot of work at this point with unknown result.

brancz commented 6 years ago

@jac-stripe a big chunk of memory in use and allocations seem to be coming from the addConditionMetrics call from the job collector. I'm not too familiar with the Job object, what are typical entries in the Job.Status.Conditions array? It seems a large chunk is coming from this loop. It's odd because the Pod collector has much less memory in use and allocated overall and the ratio of Pod collector memory and Job collector memory doesn't match the object count ratio you shared.

smarterclayton commented 6 years ago

How can I tell, which types do not have protobuf support?

You can pass multiple Accept headers and everything should just work for types that don't have protobuf like CRD or custom API extensions. I.e. Accept: application/vnd.kubernetes.protobuf, application/json

I haven't seen the "flyweight" pattern in regard to the informers

Basically transform the object coming in on the watch call into something simpler in the cache.ListWatch. So you call upstream API and get a list of pods, then you transform it into something simpler. You can make the objects "fake" API objects in most cases.

https://github.com/openshift/origin/blob/master/pkg/image/controller/trigger/cache.go#L50 takes arbitrary kube objects and uses an adapter to turn them into a *trigger.cacheEntry which is a uniform value that then goes into the store. You can also do the conversion at cache.ListWatch time but you'll have to use api.List or so.

If I understand this correctly, you mean that we turn the pattern around and based on the events

Right - I'd call this maintaining an index via the reflector rather than maintaining a cache store. So ListWatch, instead of returning the stripped down objects, actually returns a list of metrics for that object. You can wrap both the list and the watch.

We've talked about doing that transform lower down - so you'd be able to past a cache.ListWatch a Transformer that takes arbitrary valid objects and turns them into what goes into the store but haven't done it yet.

brancz commented 6 years ago

I created #264 to start using protobuf. That should be a quick win, but eventually all of the above are improvements we probably want to make.

I'm still having difficulties to understand your second point though, probably because I don't know protobuf enough or the way it's used in Kubernetes. I'm wondering how parsing a subset fields of a protobuf message works, what would the proto definitions look like? And that doesn't influence the data size transferred on wire right?

andyxning commented 6 years ago

Sorry for not following up all the comments above, but for a quick question: IIUC, protobuf can only decrease the data size between apiserver and kube-state-metrics. Thus the sync time will be decreased. The actually memory used by kube-state-metrics should not decrease.

Correct me if i am wrong.

brancz commented 6 years ago

Agreed, nonetheless an improvement we should have.

andyxning commented 6 years ago

Yep, let's first make the available improvements.

andyxning commented 6 years ago

And that doesn't influence the data size transferred on wire right?

If protobuf supports functionality like graphsql, we can request only used fields and definitely the response size should be reduced largely.

smarterclayton commented 6 years ago

You're using an informer right? Informers are an in memory cache of the full object. It's possible to transform what you get from the API server on your own into a smaller in memory representation that can be queried later, or simply do the transformation for each event and then sum later.

Kubernetes does not support partial field retrieval like you are describing (and likely never will from the server). Use versioned apis and protobuf to fetch the object, then keep only the fields you need. We may in the future add protobuf "slice" decoding that lets you skip whole tracts of the object but not in the near term.

On Thu, Sep 21, 2017 at 11:33 AM, Ning Xie notifications@github.com wrote:

And that doesn't influence the data size transferred on wire right?

If protobuf supports functionality like graphsql https://dev-blog.apollodata.com/graphql-vs-rest-5d425123e34b, we can request only used fields and definitely the response size should be decreased largely.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kube-state-metrics/issues/257#issuecomment-331194201, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p72XMXN2InXMwTmmOkXrZVwuiVN5ks5skoHFgaJpZM4PYXve .

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

brancz commented 6 years ago

This still has to be addressed in a better way than we do today.

/remove-lifecycle stale

stevenmccord commented 6 years ago

I have one that is taking upwards of 12Gi of memory.

$ kubectl get pods --all-namespaces -a | wc -l
     227
$ kubectl get jobs --all-namespaces -a | wc -l
     381
$ kubectl get cronjobs --all-namespaces -a | wc -l
     116

I am currently on this version FROM quay.io/coreos/kube-state-metrics:v1.2.0

I0404 13:44:41.963104       1 persistentvolumeclaim.go:112] collected 13 persistentvolumeclaims
I0404 13:44:41.963142       1 replicationcontroller.go:130] collected 0 replicationcontrollers
I0404 13:44:41.963193       1 limitrange.go:101] collected 1 limitranges
I0404 13:44:42.062546       1 namespace.go:113] collected 3 namespaces
I0404 13:44:42.062874       1 daemonset.go:136] collected 2 daemonsets
I0404 13:44:42.162730       1 persistentvolume.go:99] collected 13 persistentvolumes
I0404 13:44:42.162850       1 resourcequota.go:99] collected 0 resourcequotas
I0404 13:44:35.766039       1 node.go:186] collected 11 nodes
I0404 13:44:42.262993       1 pod.go:246] collected 199 pods
I0404 13:44:42.462680       1 endpoint.go:120] collected 54 endpoints
I0404 13:44:42.462735       1 service.go:112] collected 51 services
I0404 13:44:42.462956       1 node.go:186] collected 11 nodes
I0404 13:44:42.464009       1 replicaset.go:124] collected 715 replicasets
I0404 13:44:42.464576       1 persistentvolume.go:99] collected 13 persistentvolumes
I0404 13:44:42.663174       1 replicationcontroller.go:130] collected 0 replicationcontrollers
I0404 13:44:42.862684       1 limitrange.go:101] collected 1 limitranges
I0404 13:44:42.862818       1 namespace.go:113] collected 3 namespaces
I0404 13:44:42.862963       1 daemonset.go:136] collected 2 daemonsets
I0404 13:44:43.063073       1 resourcequota.go:99] collected 0 resourcequotas
I0404 13:44:43.163197       1 daemonset.go:136] collected 2 daemonsets
I0404 13:44:43.262674       1 pod.go:246] collected 199 pods
I0404 13:44:43.263091       1 node.go:186] collected 11 nodes
I0404 13:44:43.263327       1 replicaset.go:124] collected 715 replicasets
I0404 13:44:43.264189       1 endpoint.go:120] collected 54 endpoints
I0404 13:44:43.562444       1 statefulset.go:147] collected 5 statefulsets
I0404 13:44:43.562569       1 limitrange.go:101] collected 1 limitranges
I0404 13:44:43.562645       1 namespace.go:113] collected 3 namespaces
I0404 13:44:43.562677       1 deployment.go:160] collected 43 deployments

It tends to spike overnight when I am running a lot of kubernetes jobs ~100. Then it settles down after those jobs are done, but it spikes to a very high memory usage. I am probably doing something incorrectly here, but just wanted to see if there are others having this issue.

discordianfish commented 6 years ago

Unfortunately I don't have much to add beside confirming this issue. I realized there appears to be no metrics about kube-state-metrics itself, beside the process collector. I guess it would be useful to have metrics for number of registered metrics+values and durations for the collection.

brancz commented 6 years ago

There is a separate port where metrics about the state of kube-state-metrics itself are exposed, I'm more than happy to add more metrics there in the direction of what @discordianfish described.

stevenmccord commented 6 years ago

Is there anything we can do to reduce the memory consumption, or is this just an unknown resolution? Thanks!

brancz commented 6 years ago

You can turn off the collectors you don't need, but generally kube-state-metrics needs to hold the data in memory that it exposes, there's no way around it. kube-state-metrics is essentially an in-memory representation of etcd, so in larger clusters that easily gets large.

discordianfish commented 6 years ago

@brancz but the 12Gi @stevenmccord reported are not expected I assume?

andyxning commented 6 years ago

@discordianfish As stated by @brancz , a --telemetry-port has been added to kube-state-metrics and it is available from v1.2.0-rc.0 release. The default value for it is 8081 if you use the quay.io or gcr.io image.

I guess it would be useful to have metrics for number of registered metrics+values and durations for the collection.

We are glad to learn users requirements about kube-state-metrics self metrics, If you have some advice about this, feel free to tell us or you can directly make a PR to enhance this.

andyxning commented 6 years ago

@stevenmccord

Then it settles down after those jobs are done, but it spikes to a very high memory usage.

Do you run kube-state-metrics after v1.2.0-rc.0? if then, could you please paste the kube-state-metrics own metrics along side the memory usage for kube-state-metrics

discordianfish commented 6 years ago

ksm_resources_per_scrape_sum looks suspicious. It sounds like it should reflect the number of resources each scrape observed but it's every increasing until it runs out of memory at which it has e.g almost 25M replicasets.

discordianfish commented 6 years ago

Ah never mind, because it's a summary. But why is it a summary? Seems to me this should be a Gauge, no?

stevenmccord commented 6 years ago

Interesting enough, I set the CPU to 1 and the memory went way down. I am still monitoring but looks like that solved my problem.

brancz commented 6 years ago

@stevenmccord this could be due to the prometheus client library being somewhat inefficient at rendering the metrics output.

@discordianfish a single time-series would probably work similarly well, the summary is interesting as multiple Prometheus servers may be scraping the cluster level metrics and therefore setting the gauge to a new value at every scrape. With the summary we just don't miss any resource counts in consecutive scrapes. Might be overkill though.

discordianfish commented 6 years ago

@brancz I see.. Well numbers look reasonable to me.

In general, kube-state-metrics replicates the etcd state and counts + returns it when scraped, right?

brancz commented 6 years ago

Yes I've been thinking something in that direction, but it's quite a large revamp. Instead of using the kubernetes informers data store, we only use the prometheus metrics registry as the "storage". Based on the Kubernetes events we can populate the metrics in the registry and even remove them if remove events are received. Similar to informers on connection loss we can completely rebuild the metrics registry. It might even be worth to rotate the metrics registry every interval to ensure that even if events are missed we resync regularly. The nice thing about this is that we can 1) build optimized structs to unmarshal into upon received events, that way we don't perform unnecessary time parsing and 2) we can minimize the state held in total with this.

discordianfish commented 6 years ago

@brancz Nice, yeah that's how I'd imaging it. Sounds like a good plan.

discordianfish commented 6 years ago

More and more of my scrapes timeout out now too. Just increased the scrape timeout to 30s and still about half of the scrapes time out. Though some scrapes finish in ~4 seconds.

andyxning commented 6 years ago

@caesarxuchao @smarterclayton Any plan client-go will support something like what graphSQL does, i.e., just response with the needed content.

smarterclayton commented 6 years ago

No, no plan for that. Calculations like that are expensive on the server side, and likely wouldn't give a speed up here where the metrics client needs to do its own processing afterwards (protobuf is something like 4-7x more efficient in Kube than JSON in general).

On Mon, Apr 9, 2018 at 11:23 PM, Ning Xie notifications@github.com wrote:

@caesarxuchao https://github.com/caesarxuchao @smarterclayton https://github.com/smarterclayton Any plan client-go will support something like graphSQL does, i.e., just response with the needed content.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kube-state-metrics/issues/257#issuecomment-379961779, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p09CQOGdrU5iVgcSXtKGiHA2f7Paks5tnCWigaJpZM4PYXve .

brancz commented 6 years ago

GraphQL is also unlikely to help a lot with memory consumption in the above architecture as it only reduces what’s sent not what’s stored.

andyxning commented 6 years ago

Sorry for the wrong spell about GraphQL. :)

Using GraphQL should help decreasing the memory as we can only request and store needed fields for all objects instead of storing all the fields.

protobuf is something like 4-7x more efficient in Kube than JSON in general

IMO, this refers to the transport efficiency instead of memory consumption.

discordianfish commented 6 years ago

Should we maybe open another issue to track response time and memory usage issues differently?

While the memory usage is unfortunate it's IMO not very urgent. With 1Gi limit it's stable here.

The timeouts on the other hand are a real production concern.

discordianfish commented 6 years ago

Interesting! I can't reproduce this locally:

Even if I pull the same docker image I run on the cluster and pass in my admin kubeconfig, the request returns in 0.15-0.25s, returning the same results as on the cluster. This even goes via public internet from europe to US, so actually surprisingly fast.

When running wget in my prometheus container though, it takes ~2-5s, sometimes it times out after 30s. Same result when running curl in a dedicated pod in the same cluster.

Since I limit cpu=1 and even have CPUManager enabled, I doubt it's a CPU performance issue.

Smells like some networking issue but this is a production cluster which serves other traffic just fine.. Though since the metrics response is quite big, maybe this triggers some corner case? Thinking MTU etc.

Either way, I've enabled block profiles and deployed this. Maybe that helps isolating this.

discordianfish commented 6 years ago

Classical heisenbug. After deploying me changes to enable gorouting block profiles, all problems were gone. Maybe due to me building with go1.10.1 while ksm 1.3 was built with 1.10?

discordianfish commented 6 years ago

@jac-stripe / @julia-stripe: Can you try 1.3.1? Just deployed it and as suspect it looks like the go version upgrade fixed it.

gades commented 6 years ago

@discordianfish The version 1.3.1 doesn't solve the issue with context deadline exceeded or OOMKilled and need to adjust parameters.

screen shot 2018-05-23 at 11 08 53
kubectl get pods --all-namespaces -a | wc -l
996

For me, it starts working with following parameters memory allocated to 2800Mi and CPU 2100m

screen shot 2018-05-23 at 13 46 51