Closed IvanUkhov closed 4 months ago
They shouldn't be hanging around in memory indefinitely. What's more likely is that the histograms aren't being drained often enough.
Are you running the Prometheus exporter in the default scrape mode? If so, how often are you scraping the endpoint? Doing a scrape drives the logic that moves histogram values from their AtomicBucket
in the Registry
, into the condensed form in the exporter state.
This is something we've addressed in the latest (0.14.0) release where they'll be drained every five seconds by default.
Are you running the Prometheus exporter in the default scrape mode? If so, how often are you scraping the endpoint?
I am on Kubernetes Engine with autopilot, and I have not done much customization. I simply make it listen to a dedicated port via PrometheusBuilder
PrometheusBuilder::new().with_http_listener(address).install()
and have the endpoint drained every 30 seconds
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: monitoring
spec:
selector:
matchLabels:
instrumentation: service
endpoints:
- port: instrumentation
interval: 30s
So the scraping is supposed to happen every 30 seconds, but the memory-consumption curves keep on growing for hours and days until the service happens to be redeployed. And metrics do show up in Cloud Monitoring; it is not that they are not collected.
Doing a scrape drives the logic that moves histogram values from their AtomicBucket in the Registry, into the condensed form in the exporter state.
Yeah, I would spontaneously expect to not have any kind of raw-value accumulation inside but direct bucketization. Now I know that this is not the case. The example I gave is then invalid, as I was running it locally without scraping. I thought it would not matter. I observed the same behavior as in the cluster and concluded that was it.
I will try to do some local scraping to confirm.
So, if scraping is happening in your real workload and memory is still increasing, then based on your code snippet, it sounds like maybe you're just accumulating idle metrics over time?
This is common with metrics that have unbounded cardinality, such as tags based on request IDs or user IDs or any sort of value that has a high cardinality in general.
If those metrics are emitted sparsely, the exporter is going hold on to them effectively forever, so the exporter state will just keep growing over time. This can be controlled by configuring the idle timeout (PrometheusBuilder::idle_timeout
) which controls how we remove idle metrics from the registry, in terms of how long since a given metric has been last updated.
it sounds like maybe you're just accumulating idle metrics over time? This is common with metrics that have unbounded cardinality, such as tags based on request IDs or user IDs or any sort of value that has a high cardinality in general.
If you mean that, for instance, the name of the metric includes something that varies from request to request and makes them unique then no. It is just around five or so metrics with static names.
If those metrics are emitted sparsely, the exporter is going hold on to them effectively forever, so the exporter state will just keep growing over time. This can be controlled by configuring the idle timeout (PrometheusBuilder::idle_timeout) which controls how we remove idle metrics from the registry, in terms of how long since a given metric has been last updated.
In my setup, it is those five metrics that are being recorded upon each request. So in my mind, before you mentioned the need for scraping, it was five histogram under the hood (that is, a fixed memory requirement).
Well, having only five unique metrics certainly squashes the theory of it being unbounded cardinality.
Does the memory growth subside if you don't install a global recorder?
Now running locally a Prometheus process in addition. The memory usage is fluctuating all the time. Hard to tell, but there does not seem to be any upward trend anymore. I guess I just stumbled upon the accumulation you mentioned, which will be addressed in 0.14, and misinterpreted it as a reproduction of the behavior I see in the cloud.
Does the memory growth subside if you don't install a global recorder?
Then the memory usage is as flat as it can be. I will try doing it in production to rule out metrics.
About the scraping, is it that only a few values that are pushed to the registry upon each scraping request or all pending? Does one have to match the frequency of the scraping with the frequently of other requests to prevent atomic blocks from accumulating, in other words?
Removed the global recorder in the cluster. The readings disappeared as expected, but the memory consumption still keeps on growing. So it is something else. Not the recorder at least. But I suppose, without a recorder, metrics are no-op. So it is then some other code.
I apologize for the hassle and thank you for the help!
No worries at all, and good luck tracking down your leak! 🔍
Hello,
I have a service whose memory consumption is slightly but steadily growing during runtime. After profiling, I came to the conclusion that it originates from
metrics
. More specifically, it seems to be due toAtomicBucket::push
inmetrics-util
creatingOwned
fromcrossbeam-epoch
:https://github.com/metrics-rs/metrics/blob/a143ef60b5354770fb292d29acc1f7bdef2c4aed/metrics-util/src/bucket.rs#L224
They are allocated but seem to hang around in memory indefinitely. Is it the expected behavior, or Is there something I am missing?