Closed beorn7 closed 7 years ago
I too have had issues with this, I've found Prometheus very quickly eats up a lot of RAM and it can't easily be managed.
@sammcj The problem here is that there is no standard way to get a Go program's actual memory consumption. The heap sizes reported in http://golang.org/pkg/runtime/#MemStats are usually off by a factor of 3 or so from the actual resident memory. This can be due to a number of things: memory fragmentation, Go metadata overhead, or the Go runtime not returning pages to the OS very eagerly. A proper solution needs yet to be found.
One thing you can tune right now is how many sample chunks to keep in RAM. See this flag:
-storage.local.memory-chunks=1048576: How many chunks to keep in memory. While the size of a chunk is 1kiB, the total memory usage will be significantly higher than this value * 1kiB. Furthermore, for various reasons, more chunks might have to be kept in memory temporarily.
Keep in mind this is only one (albeit major factor) in RAM usage. Other factors are:
Also the various queues (Prometheus's own ones like sample ingestion queue but also Go and OS internal like queued up network queries or whatever). I have that idea of implementing a kind of memory chaperon that will not only evict evictable chunks but also throttle/reject queries and sample ingestion to keep total memory usage (or the amount of free memory on the machine) within limits. But that's all highly non-trivial stuff...
There are by now many things that may take memory, and there are many knobs to turn to tweak it. I changed then name of the issue into something more generic.
A good news already: The ingestion queue is gone, so there will not be wild ram usage jumps anymore if ingestion piles up scrapes.
I am running with a retention of 4 hours, and the default "storage.local.memory-chunks", on version 0.13.1-fb3b464. While @juliusv said that I should expect that this means that the memory used up may be more than 1GB I am seeing it run out with the docker container setting at 2.5GB. Basically it looks like a memory lea, because on restart all of the memory goes back down and then slowly over time creeps back up. Is there any formula that could give me a good idea what to set the memory limit to, is there any way I can figure out if there is a leak somewhere or if it just because more and more data is coming in.
@a86c6f7964 One thing to start out with: if you configure Prometheus to monitor itself (I'd always recommend it), does the metric prometheus_local_storage_memory_chunks
go up at the same rate as the memory usage you're seeing? Or does it plateau at the configured maximum while the memory usage continues to go up? Checking prometheus_local_storage_memory_series
would also be interesting to see how many series are current (not archived) in memory. If those are plateauing, and the memory usage is still going up, we'll have to dig deeper.
ya it was going up. it got almost to 1million, so maybe it just needs a little more memory
@a86c6f7964 Retention is fundamentally a bad way to get memory usage under control. It will only affect memory usage if all your chunks fit into memory. Retention is meant to limit disk usage.
Please refer to http://prometheus.io/docs/operating/storage/#memory-usage for a starter. Applying the rule of thumb given there, you should set -storage.local.memory-chunks
to 800,000 at most if you have only 2.5GiB available. The default is 1M, which will almost definitely make your Prometheus use more than 2.5GiB in steady state.
I recommend to start with -storage.local.memory-chunks=500000
and a retention tailored to your disk size (possibly many days or weeks).
Problem here is that "what's my memory usage?" or "how much memory is free on the system?" are highly non trivial questions. See http://www.redhat.com/advice/tips/meminfo.html/ as a starter...
I'm currently running prometheus (0.15.1) on a bare metal server with 64GB memory, default settings (except retention, one week) and around 750 compute servers to be scraped every 30s. The server is dedicated to prometheus.
We have observed that memory consumption is going up until the machine is not responding anymore. It takes around two days to reach this point and killing prometheus process does not free all memory immediately. As suggested by @juliusv , I monitored prometheus_local_storage_memory_chunks
. It started with 1.242443e+06 and end up in a plateau around 1.86631e+06, please see bellow. My question is, what should I look to get more information about this growing and from where is coming from?
Mem used 34549836 KiB
prometheus_local_storage_memory_chunks 1.964022e+06
prometheus_local_storage_memory_series 1.822449e+06
Mem used 38098228 KiB
prometheus_local_storage_memory_chunks 2.013611e+06
prometheus_local_storage_memory_series 1.648374e+06
Mem used 41139708 KiB
prometheus_local_storage_memory_chunks 2.062455e+06
prometheus_local_storage_memory_series 1.472947e+06
Mem used 53843712 KiB
top: 1431 prometh+ 20 0 21.967g 0.015t 7968 S 99.4 23.9 2189:20 prometheus
prometheus_local_storage_memory_chunks 1.899084e+06
prometheus_local_storage_memory_series 1.653677e+06
Mem used 56187240 KiB
top: 1431 prometh+ 20 0 22.384g 0.015t 7968 S 86.8 25.0 2441:44 prometheus
prometheus_local_storage_memory_chunks 1.969578e+06
prometheus_local_storage_memory_series 1.518563e+06
Mem used 63289448 KiB
top: 1431 prometh+ 20 0 23.886g 0.017t 7972 S 88.5 27.4 3461:40 prometheus
prometheus_local_storage_memory_chunks 1.86631e+06
prometheus_local_storage_memory_series 1.518586e+06
@pousa Yeah, sounds like that kind of server should normally not use that much RAM.
Some things to dig into:
sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m]))
rate(prometheus_local_storage_ingested_samples_total[5m])
. with 1.5 million series and a 30s scrape rate, I'd expect roughly 50k samples per second.go tool pprof http://prometheus-host:9090/debug/pprof/heap
could be interesting to see what section of memory is growing over time (web
in the resulting pprof shell will open an SVG graph in the browser).I would say that the machine does get that much traffic. I had to reboot it in the afternoon (again memory problems) and right now it has:
sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) = 0.029629190678656617
rate(prometheus_local_storage_ingested_samples_total[5m]) = 12478.366666666667
They are node exporter and a collectd plugin developed by us, that basically aggregate data from /proc/PID/stat for each job. On our data center, a job is basically a parallel application from HPC domain composed of processes|threads.
Great! Thanks for the pprof tip, I will do it and post it here later.
@pousa That sounds like a low number of queries and very reasonable ingestion rate for that kind of server. Though that makes me wonder, if you have 1.5 million series active in memory, I would have expected ~50k samples per second ingestion rate (at 30s scrape interval) instead of 12k/s, unless your series are frequently changing, so that only a small subset of active memory series gets an update on every scrape. If you are monitoring jobs via /proc/PID/stat
, and these jobs are labeled by their pid, I wonder if that is what's leading to frequent churn in series (by PIDs changing all the time?). Still not sure how exactly that would lead to your memory woes though.
Your memory usage is shooting up, while memory chunks and series are staying pretty constant. Weird!
@juliusv the ingestion rate keeps around this and the number of monitored jobs is around 7k. I do not monitor PID explicitly, but JOBIDs (set of PIDs). However, they also change quiet a lot. Jobs have a maximum duration of 4h, 24h or 120h. And it is common to have jobs that run for few minutes.
Yep, that is why I posted here. I could not understand this either. I still have to run pprof, will do this today.
@juliusv run pprof and saw only 3GB being used. Need to investigate more..
@pousa The Go-reported heap size is always smaller than what the OS sees in terms of resident memory usage (due to internal metadata overhead and memory fragmentation), but that's usually factor 2 or so. I don't see how it would report 3GB, but then fill up 64GB in reality. Odd!
@juliusv Thanks for the information. Ideed odd. I replicated the service on a different server today and I'm monitoring both servers/prometheus. I want to see if this could some how be related to the server and not prometheus itself, since top reports only half of memory being used by prometheus.
I have similar issue as @pousa , the EC2 instance repeatly run out of memory after about 2 - 4 days. I do have much less memory than @pousa . But I am wondering what is the minimum/recommended memory capacity is required for running prometheus for long-term. Is it possible for prometheus to control its memory usage automatically instead of exhausting all the memory on the instance until it dies?
@killercentury I still have the problem. I tried to build it with the newer version of GO but, no luck. I also looked into GO runtime environment variables, but could not find anything.
Hi everybody, please read http://prometheus.io/docs/operating/storage/ . @killercentury if you have very little memory (less than ~4GiB), you want to reduce -storage.local.memory-chunks
from its default value. @pousa with 1.5M time series, you want to increase -storage.local.memory-chunks
to something like 5M to make Prometheus work properly. -storage.local.max-chunks-to-persist
should be increased then, too. Each time series should be able to keep at least 3 chunks in memory, ideally more... Also, if -storage.local.max-chunks-to-persist
is not high enough, Prometheus will attempt a lot of little disk ops, which will slow everything else down and might increase RAM usage a lot because queues fill up and everything. That's especially true with 7k targets. If everything slows down, this might easily result in a spiral of death... Once you have tweaked the two flags mentioned above (perhaps to even higher settings), I would next increase the scraping interval to something very high (like 3min or so) to check if things improve. Then you can incrementally reduce the interval until you see problems arising. (In different news: 7k is a very high number of targets. Sharding of some kind might be required. But that's a different story.)
And yes, ideally all of these value would auto-tweak themselves. However, that's highly non-trivial and not a priority to implement right now.
@beorn7 I will try tweaking those flags and if needed the scrape interval. Thanks!
@beorn7 Changing those flags and increasing a bit the scraping interval allow our prometheus instances to run without running out of memory. Thanks! However, I still have long timings to get back results from expressions... sometimes even get timeouts. Concerning sharding, we were already doing it.
Expensive queries can be pre-computed with recording rules: http://prometheus.io/docs/querying/rules/
To try out very expensive queries, you can increase the query timeout via the -query.timeout
flag.
(Obviously, that's all now off-topic and has nothing to do with memory usage anymore. ;)
@beorn7 Thanks, I will try the flag. We already have rules in place, and I was talking about very simple queries (e.g single metrics). But, as you said, this is a different topic ;)
If a single time series takes a long time to query, then we are kind of on-topic again. Because the time is most likely needed to load chunks from disk. Tweaking the flags discussed here, you can maximize the number of chunks Prometheus keeps in memory, and thereby avoid loading in chunks from disk. But in different news, loading a single series from disk should be very fast (because all the data is in a single file, one seek only). So I guess your server is very busy and overloaded anyway, so that everything is slow.
Are we really talking about single series and not single metric names, but with multiple series?
On Wed, Sep 16, 2015 at 2:43 PM, Björn Rabenstein notifications@github.com wrote:
If a single time series takes a long time to query, then we are kind of on-topic again. Because the time is most likely needed to load chunks from disk. Tweaking the flags discussed here, you can maximize the number of chunks Prometheus keeps in memory, and thereby avoid loading in chunks from disk. But in different news, loading a single series from disk should be very fast (because all the data is in a single file, one seek only). So I guess your server is very busy and overloaded anyway, so that everything is slow.
— Reply to this email directly or view it on GitHub https://github.com/prometheus/prometheus/issues/455#issuecomment-140730386 .
@juliusv and @beorn7 I'm talking about single metric names, e.g. virtual memory of jobs or processor load on servers. But, this should not take that much time, right? When My timeout is set to 3m and only happens when I see with top that all memory is being used on the server.
@pousa If the server is swapping, all bets are off. If it's just using a lot of memory and not swapping, it really depends on how many series the metric consists of, and what your exact query is (graph or tabular?). A single metric may still consist of tens of thousands of time series or more, in which case a very busy server might become too slow to even do a tabular query for all series matching that metric name (though tens of thousands normally still works fast and fine in the tabular view). Some more details about the query would be interesting...
@juliusv I will have to check if server is swapping when this happens. I don't see any timeout now, it usually starts when all memory is being used... and this takes around 2-3 days after I start prometheus on the server.
Concerning the query, we only use tabular queries. And what we use most here is something very simple like:
collectd_jobs_vm{encl="45"}
which shows virtual memory for all jobs running on chassis 45. And this is enough to get timeout when server has all memory being used.
Ok, assuming there are not a hundred thousand jobs on chassis 45, I'd expect this query to always complete in a reasonable amount of time unless the machine is swapping (or the server is otherwise so incredibly overloaded that nothing really works anymore). So yeah, check the si
and so
columns of vmstat 1
says next time that happens :)
I'm experiencing almost the same problem as @pousa, on 0.20.0. Much smaller setup though, only about 50 nodes with 30s interval.
prometheus_local_storage_memory_chunks = 1048996
prometheus_local_storage_memory_series = 897363
rate(prometheus_local_storage_ingested_samples_total[5m]) = 6463
sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) = 0.0169
pprof.prometheus.inuse_objects.inuse_space.001.pb.gz prom-metrics.txt
Looks reasonable, yet my prometheus instance is using upwards of 26Gi of rss. When I do a heap dump with debug/pprof, it only shows 4Gi in usage.
I haven't tried the solution of increasing max chunks, although it seems counter intuitive to me as the docs say it increases memory usage. Although it may fix things, I'd really like to understand what is consuming the ~22Gi of out of heap memory, so I can plan my instance size properly and know how to tune things. Any ideas where to start looking?
Do you by any chance either have only a handful of targets that are really large, or are federating or otherwise querying large amounts of data?
Biggest targets are kubelet which exports cadvisor metrics: a simple wc -l shows 2258 lines (includes comments). No federation, although we do pull from a pushgateway but that has only about 1k lines.
That's quite odd then. Can you repeatedly take heap snapshots and find what's taking up the space? Go only gives up memory every 5m or so.
I'd like to second this issue. Having quite the same issues even with a small scale environment of one EC2 node, running 15 docker containers, scraped by cAdvisor + host metrics.
rate(prometheus_local_storage_ingested_samples_total[5m]) = 371
prometheus_local_storage_memory_chunks = 524140
prometheus_local_storage_memory_series = 2409
I set the -storage.local.memory-chunks=524288
and Prometheus container used ~220MB RAM after recreation. Over the period of 1 week, RAM usage constantly climbed up to 1.3GB as of now.
To me it seems like Prometheus is eating up memory as long as it can. If I would have to make an educated guess, I'd say this is some sort of memory leak.
Here is a graph of Prometheus own memory usage. The purple graph is the old container with default storage.local.memory-chunks
and the yellow graph is the new one with that setting halved to 524288
My next idea would be to limit prometheus docker container max memory. However I'm not sure how Prometheus will react. If its really a memory leak, it could dead-lock :scream:
@philicious 1.3GiB RAM usage is very reasonably with 500k chunks in memory. The memory usage will climb up until Prometheus has maxed out the configured number of memory chunks. Then it should stabilize, with spikes while running queries, scrapes, whatever else will require RAM temporarily. If you limit RAM usage hard, Prometheus will simply crash, i.e. Prometheus does not detect available memory in any way. The only lever you have is the storage.local.memory-chunks
flag, which is not rocket science, but RAM in kiB divided by 5 as the flag value should work out fine in most cases.
@jsravn The tens of GiB used is really weird. Something irregular is going on here. Run-away scrape, as Brian suspected, as an example. Or some of the more exotic features going wild. Are yue using remote storage or a special kind of service discovery? It's unlikely this has to do with storing ingested samples in RAM.
@beorn7 oh ok. I read https://prometheus.io/docs/operating/storage/ but thats vague about relation between RAM usage and that flag.
So if I got you right: one can approx expect(storage.local.memory-chunks \ 1024)*5 ~= max RAM used in MiB
?
Strictly speaking, each chunk will only take 1k of RAM. But then there is some overhead of managing the chunks, and then there is a whole lot of other things the server is doing, most notably serving queries. Each time series in memory has a footprint, too, which becomes relevant if you have a lot of time series with relatively few chunks in memory. The x5 multiplier just turned out as a threshold that is rarely crossed. In most cases, it will be clearly below that. In extreme cases, it will be above that.
Having something smarter than that, is the whole point of this issue. However, in practice, tweaking for your available memory is rarely a problem (speaking as somebody who is in charge of 50+ production Prometheus servers with very different loads).
@beorn7 ye ok. understood. Usually I either dont care and just give it more RAM or even tweak that flag upwards. Unfortunately, for the first time, I have to squeeze out some memory. Now, thx to you, I have a better understanding of that flag. Maybe you could add this x5 multiplier rule-of-thumb to the aforementioned docs page
The doc states "As a rule of thumb, you should have at least three times more RAM available than needed by the memory chunks alone."
Adding something like "4x is a safer bet, and 5x is almost certainly but not always safe" sounds a bit like Monty Python... ;)
@beorn7
The tens of GiB used is really weird. Something irregular is going on here. Run-away scrape, as Brian suspected, as an example. Or some of the more exotic features going wild. Are yue using remote storage or a special kind of service discovery? It's unlikely this has to do with storing ingested samples in RAM.
We're using the aws node discovery. I could try turning that off and see what happens. I'm away for a couple days but I'll give it a shot when I get back. We're writing to a provisioned iops EBS in AWS.
Here's a couple more screenshots. Here's the container memory usage:
Here's disk i/o. It seems high to me (70MB/s sustained writes), but not sure. We provisioned 2000 iops on the EBS, so there's headroom left, and queue sizes are low, so I don't think it's a bottleneck.
I'll try a few things:
Thanks for the help so far.
get heap dumps as requested every 5 minutes
I'd like you to take heap dumps as often as you can get. When you get one that shows the 26GiB of usage in Go, please send it on.
Had a crash, and noticed this in the logs:
time="2016-07-26T15:20:28Z" level=info msg="File scan complete. 10092420 series found." source="crashrecovery.go:81"
It's taking a long time to do crash recovery as well, > 15minutes. For 50 nodes, 10 million time series sounds quite wrong.
edit: Actually, it may make sense for us, since cadvisor (kubelet) generates new metrics for new containers, and we're constantly deploying new versions.
Ok I took continuous heap dumps, enabled gctrace=1
and restarted prometheus.
Here's first 5 minutes rss usage:
I did notice cadvisor seems to be reporting a larger container memory usage than the system though, but it's still high (21.8Gi):
# ps aux | grep prom
root 14613 316 70.1 22824660 22806488 ? Ssl 19:09 37:31 /bin/prometheus -config.file=/etc/prometheus/prometheus.yml -alertmanager.url=http://alertmanager -log.level=debug -web.external-url=http://prometheus-default-tools.k8s.api.bskyb.com -storage.local.retention=360h0m0s
Largest collection shows 14Gi, and target heap hovers around 8-10:
c 41 @148.186s 6%: 74+1032+0.93 ms clock, 599+2167/2065/2951+7.4 ms cpu, 9404->9426->5986 MB, 9431 MB goal, 8 P
scvg0: inuse: 6363, idle: 7996, sys: 14359, released: 0, consumed: 14359 (MB)
gc 42 @152.086s 6%: 7.0+1165+0.65 ms clock, 42+5814/2330/183+3.9 ms cpu, 12369->12370->8385 MB, 12370 MB goal, 8 P
gc 43 @166.613s 6%: 25+2241+1.0 ms clock, 103+2721/4482/303+4.3 ms cpu, 14867->16218->6073 MB, 16771 MB goal, 8 P
gc 44 @170.886s 6%: 1.1+1023+0.72 ms clock, 9.5+307/1941/4909+5.7 ms cpu, 8375->8772->6420 MB, 9448 MB goal, 8 P
time="2016-07-26T19:12:20Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
gc 45 @181.074s 6%: 2.4+1137+1.0 ms clock, 12+4520/2258/173+5.0 ms cpu, 11755->12011->4152 MB, 12049 MB goal, 8 P
gc 46 @184.955s 7%: 4.7+1187+0.72 ms clock, 37+2644/2238/2261+5.7 ms cpu, 7601->7823->4221 MB, 7796 MB goal, 8 P
2016-07-26-prom-logs-startup.txt
Largest heap dump was 7380MiB:
Fetching profile from http://prom/debug/pprof/heap
Saved profile in /home/me/pprof/pprof.prom.inuse_objects.inuse_space.353.pb.gz
0 0% 99.39% 7380.74MB 99.67% runtime.goexit
pprof.prom.inuse_space.353.pb.gz
I also did continuous heap dumps on an earlier, longer running prometheus and got a 9081MiB heap as the largest (with it usually being about 5-6): pprof.prom.inuse_space.088.pb.gz
Ok the reason cadvisor shows higher memory usage is it includes file caches for the process in addition to rss. So about 6Gi of file cache:
# cat /sys/fs/cgroup/memory/docker/f9574512ad67ab6303907c33b522794cbe151ce6b77ad873854e9c6a433c8a50/memory.stat
cache 6561488896
rss 24723570688
rss_huge 16739467264
mapped_file 7610368
swap 0
pgpgin 57215766
pgpgout 53660702
pgfault 2006917
pgmajfault 50
inactive_anon 0
active_anon 24723570688
inactive_file 5335183360
active_file 1226301440
@jsravn With almost a million time series in your checkpoint and 10M file series altogether, you should definitely configure more than the default 1M memory chunks, more like 3M. Otherwise, you will have a lot of eviction and reload. Perhaps that causes the GC to not keep up. The GC amount looks pretty big to me. Please read https://prometheus.io/docs/operating/storage/ and apply the tweaks recommended there.
@beorn7 I tried doubling it to 2M memory chunks, but it didn't seem to make a difference. Looking at the metrics, the process hits >20Gi rss long before it even reaches 1M memory chunks, it's usually about 700K-800K when it hits 20G+ then it slowly increases up to the chunk limit.
At a bit of a loss now, I spent a lot of time trying all sorts of combinations. Disabling all scrape config + alerts brings memory usage down to 10Gi. Turning on either alerts or scrape config seems to put it back. I've also raised scrape time to 60s and retention time to 180hours, but that hasn't seemed to make much difference at all.
Ok finally got somewhere. I was able to reduce memory usage by about 30-40% by disabling huge pages, e.g. echo never > /sys/kernel/mm/transparent_hugepage/enabled
. prometheus rss has dropped to a steady 11-13Gi vs 18-22Gi+. This sounds like the hugepage problems golang 1.5 had (https://github.com/golang/go/issues/8832). I'm not sure why it happens in my case though - maybe due to the large number of new time series being created because of containers spinning up.
This seems to have finally got prometheus's memory usage under control for me:
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'
echo madvise > /sys/kernel/mm/transparent_hugepage/defrag'
Which seems to let the go runtime's MADV_HUGEPAGE
/MADV_NOHUGEPAGE
operate correctly. With these set to always
(default on RHEL7), it doesn't seem to play nicely with golang's gc and I get that large rss usage after a while.
I'll keep prometheus running for a few days with these settings and see how it goes. So far, this has dropped my mem usage to 7-10Gi, and cpu usage has dropped down to 1-2 core usage from 3-4.
Currently, Prometheus simply limits the chunks in memory to a fixed number.
However, this number doesn't directly imply the total memory usage as many other things take memory as well.
Prometheus could measure its own memory consumption and (optionally) evict chunks early if it needs too much memory.
It's non-trivial to measure "actual" memory consumption in a platform independent way.