Memory issues on short block sizes

radeklesniewski commented 5 years ago

Bug Report

We are testing Prometheus with Thanos to gather metrics from multiple Openshift clusters. On each we have significant amount of ephemeral pods, which are being scraped by Prometheus. We encounter strange memory consumption pattern during test that resembles our real time metrics almost 1:1. We are able to process millions of metrics when changes for particular label occurs every ~2 hours, while for the changes which are happening more often we see constant increase in memory consumption. We tried to lower retention time to 15 minutes, and block size down to 5 minutes, but this didn't help as a lot as we still were seeing the increase in memory consumption, even after the retention period. Further investigation has shown that short block sizes together with short retention caused degradation of performance of writing the blocks to disk. After 30 minutes of test the 5 minutes block was being flushed longer than 10 minutes, +/- 3m for head GC.

Additionally, we are observing unexpected behaviour of retention policy execution. As far as we know after debugging Prometheus, there are two conditions for block to be deleted:

Block is beyond retention,
Block is not opened while dropping old blocks from memory occurs.

Example of unexpected behaviour: Config: retention=15m, min-block-duration=5m, max-block-duration=5m Behaviour: old blocks are deleted not after ~20 minutes (retention + newest block duration), but after ~60 minutes (Prometheus instance is not queried while the test is performed, so probably there is no reason for block to be loaded in memory). Test is executed up to ~2 hours and during that time we are observing constant increase in memory consumption, even though after 1st hour of test number of blocks and data series is constant.

After locking label to change every 2 hours, we could easily use longer blocks and retention periods as memory consumption was stable. Memory consumption peaks at 16GB with average at 11GB for 3M metrics per minute, while for labels which were changing every minute for 1.5M after ~30m it crashed at ~100GB.

In both cases proportions between number of different metrics to number of blocks was constant, which raises a question, why is it behaving like that and how can we tweak the Prometheus to be able to perform better with high cardinality labels (e.g. cron job id / pod name). We are able to access and aggregate over such metrics in Thanos so Prometheus would need to be able to scrape and preserve data only for a time needed to flush block to disk.

What did you expect to see? We expected to see proportional relation between block sizes, number of different metrics (cardinality / data points), retention and memory consumption.

What did you see instead? Under which circumstances? Significant memory consumption increase for metrics with high cardinality / variable label

Environment Openshift v.3.9 with live data RHEL 6 VM with artificial data (Linux 2.6.32-696.18.7.el6.x86_64 x86_64)

Prometheus version: Tested for versions: 2.3, 2.4, 2.5
Prometheus configuration file: 3000 jobs, 1 minute scrape interval, 1-3M metrics(total)

brian-brazil commented 5 years ago

hile for labels which were changing every minute for 1.5M after ~30m it crashed at ~100GB.

Prometheus is a time series monitoring system, attempting to use it as a general event logging TSDB is going to eat a lot of RAM given that we're optimised for time series.

It's also not recommended to change the block settings, the flags are only there for our own benchmarking.

I'd suggest trying 2.6.0 (it has memory improvements for extreme cardinality cases), and giving it as much RAM as it needs. I make it at least 80GB if you've got total churn every 15m.

radeklesniewski commented 5 years ago

Thanks for clarification. We are continuing tests for 2.6.0-rc1 version, however the results are similar to previous versions.

I'm aware that changing block settings is not recommended, but we have quite specific use case for Prometheus & Thanos integration with high cardinality labels (e.g. cron job id / pod name), which does not allow for stable Prometheus instances with blocks duration set to ~2h.

My main concern is, that with shorter blocks Prometheus is dropping old blocks after ~4 times longer period than configured retention policy time and time needed for writing single block to disk is increasing proportionally with up time of Prometheus instance. Have you observed similar behavior during benchmarking? Is there any option to mitigate such issues?

krasi-georgiev commented 5 years ago

I am sorry but this sounds like a very specific use case and I don't think we would be able to support it.

krasi-georgiev commented 5 years ago

can you maybe add some more specifics about the setup. How many pods are you trying to monitor and how often do these change completely?

One suggestion is maybe drop some labels that are unique for every pod and thus reduce the cardinality.

radeklesniewski commented 5 years ago

We will be starting with monitoring of 100 pods (eventually we will need to monitor 1000 pods). Each of pods can produce up to ~20k metrics at peak traffic (~50% of metrics can be sparse and are reported every couple minutes). Only 5% of metrics reported by single pod will have labels related to name of the pod/container etc. This is the reason why we would like to keep as short blocks as possible, because in our setup with Thanos we want to use Prometheus only as a scraper of metrics.

brian-brazil commented 5 years ago

That sounds more like event logging than time series monitoring, as time series should not be sparse. 20k metrics from each pod also sounds on the excessive side.

krasi-georgiev commented 5 years ago

@radeklesniewski yeah it sound like you need to rethink the design somewhere.

radeklesniewski commented 5 years ago

Unfortunately it's not easy to redesign our solution - it's existing workflow of our company and we need to migrate it to OpenShift the way it's used currently. There is couple of reasons why we are not able to replace it with event logging:

we need to aggregate such metrics,
metrics are essential for billing purposes,
those data series are crucial for alerting.

We've been working with Thanos and testing it against such volume of data - it is working correctly and is more cost effective than event logging. I think Prometheus would be working fine for our use case, because we are interested only in keeping latest blocks - the problem is that Prometheus is consuming more and more memory, because it is not able to delete blocks older than 15 minutes. I know it is generally not recommended configuration if someone is using Prometheus for monitoring and querying data series, but I think there are no obstacles for being used only for scraping data & building blocks for Thanos? Have you observed such issues with dropping older blocks during benchmarking?

krasi-georgiev commented 5 years ago

I just double checked the code and the retention doesn't account for the data that is not persisted in a block and it currently stays in the WAL and memory/head. So this is what is causing the delayed retention deletion of the blocks.

I am thinking to open a PR to calculate the retention settings more accurately by accounting the data in the head as well. Will that solve your problem?

radeklesniewski commented 5 years ago

Let's assume the situation: Prometheus is configured to have 15m retention policy and 5m blocks. After 20m of running there are 3 blocks written into disk and data from last 5 min is in the WAL. So as far as I understand, can data series from persisted blocks be still in head/memory? Therefore retention can't delete such blocks?

If it's the root cause of my problem, such PR would be perfect :) Probably not only for me, because it can be common case for more people and it would make Prometheus & Thanos integration more flexible.

krasi-georgiev commented 5 years ago

Prometheus is configured to have 15m retention policy and 5m blocks. After 20m of running there are 3 blocks written into disk and data from last 5 min is in the WAL. So as far as I understand, can data series from persisted blocks be still in head/memory? Therefore retention can't delete such blocks?

After persisting the data from the head into a block this data is removed from the head. https://github.com/prometheus/tsdb/blob/6d489a1004dc3f7fa1ebffb66f6cebc2a0278fe7/db.go#L595

I am almost certain that blocks are loaded only when querying and during compaction.

The retention waits for all opened blocks to be closed and deletes them after that. So an open block still gets deleted when it is outside the retention settings.

https://github.com/prometheus/tsdb/blob/6d489a1004dc3f7fa1ebffb66f6cebc2a0278fe7/db.go#L572-L586

radeklesniewski commented 5 years ago

So it does not explain my issue...

Just in case you would like to reproduce this on your side: https://github.com/SabreOSS/tsdb/tree/feature/additional_logging_for_compaction - link to forked repo, there is added logging on compact.go/write operation, https://drive.google.com/open?id=1-6PTQ9PfR1Xhb2xtL2GLTO23gMBT4wdA - metric generator we've been using for tests.

krasi-georgiev commented 5 years ago

each series will occupy some amount of memory and there is no way around it. so with high cardinality the total memory needed will depend on how often the time series change, how often you scrape the targets and how often the head is persisted to disk. If even after lowering the blocks size, the memory is not enough there is nothing else that I can think of which will help with decreasing the memory usage.

krasi-georgiev commented 5 years ago

min-block-duration=5m, max-block-duration=5m - this should also disable compaction which is also another memory hungry operation so again I think you have done all the necessary tweaking and can't think of anything else that might help.

krasi-georgiev commented 5 years ago

@radeklesniewski is there anything else you think we should discuss or test here?

radeklesniewski commented 5 years ago

@krasi-georgiev sorry for the late reply. We are working further on memory & cpu profiling to find the root cause of occurring issues. At the current stage, we've spotted two things that may be interesting for you:

Most of CPU time is consumed by GC related operations:

obraz

Prometheus reports five times bigger alloc_space than inuse_space:

mgraph_alloc_space.pdf

mgraph_inuse_space.pdf

We will continue investigating resource utilization issues and I will get back to you with findings on Friday or Monday.

krasi-georgiev commented 5 years ago

thanks for the update. That indeed looks interesting. If you find a way to optimise any of the big blocks I am happy to review a PR.

radeklesniewski commented 5 years ago

@krasi-georgiev we've made more detailed analyze of resource utilization and we have couple of questions & findings we would like to discuss:

We've seen changes in SortedPostings function committed on 3rd of January (https://github.com/prometheus/tsdb/commit/296f943ec4aad56409f35e3916ceabf9f8c0b61c) and (https://github.com/prometheus/tsdb/commit/b2d7bbd6b11d8c9a94c0d23da0af2a60650174c3). We've continued test with 8M non-sparse metrics scraped every minute to compare time needed for persisting single 5m block: previous version of sorting: avg 6m50s vs new version of sorting: avg 4m10s.
We've spotted that most of CPU time during writing block is spent on sorting:

obraz

So we've tested custom version with removed sorting section and avg time needed for writing block is 2m27s. But the block size on disk is 1.2 GB so it looks like Prometheus is writing block with speed 8.16 MB/s approximately. So I have two question:

what is the role of sorting postings? Is it used for optimization of block compaction or queries with regexp pattern on particular label? If we don't need compaction, because of equal size of blocks, is it safe to delete sorting operation in our custom version of Prometheus to optimize it for sparse metrics?
what is Prometheus doing with data/metadata besides sorting on writing block? What is the cause of that speed of this operation is relatively so slow?

Using profiling tools we've seen that runtime.* functions consume a lot of CPU time on conditional statements and it really bothers me. Is it something acceptable in Go applications? :

source_code_times.zip

krasi-georgiev commented 5 years ago

what is the role of sorting postings? Is it used for optimization of block compaction or queries with regexp pattern on particular label? If we don't need compaction, because of equal size of blocks, is it safe to delete sorting operation in our custom version of Prometheus to optimize it for sparse metrics?

I asked the same question few weeks ago as part of the PR you are mentioning and IIRC the querying expects sorted postings. newMergedSeriesSet

what is Prometheus doing with data/metadata besides sorting on writing block? What is the cause of that speed of this operation is relatively so slow?

I don't fully understand this question.

Using profiling tools we've seen that runtime.* functions consume a lot of CPU time on conditional statements and it really bothers me. Is it something acceptable in Go applications? :

not very familiar, but I think runtime is GC, goroutine scheduling etc so I think it is normal to see some high activity there in the case of Prometheus.

radeklesniewski commented 5 years ago

what is Prometheus doing with data/metadata besides sorting on writing block? What is the cause of that speed of this operation is relatively so slow?

I don't fully understand this question.

For test purposes, to improve speed of writing single block to disk we've removed sorting operation. Even though sorting was removed, the average speed of writing block to disk was 8.16 MB/s - which is quite disappointing and is not caused by disk I/O speed. So my question is: what takes so long for Prometheus to persist single block? Below is flame graph for tsdb.(*LeveledCompactor).write with removed postings sorting:

obraz

None of visible operations/functions seems to be time costly.

Using profiling tools we've seen that runtime.* functions consume a lot of CPU time on conditional statements and it really bothers me. Is it something acceptable in Go applications?

not very familiar, but I think runtime is GC, goroutine scheduling etc so I think it is normal to see some high activity there in the case of Prometheus.

If you check the file I've attached, you can see that f.e. function findObject spends 91 % of time on conditional statement "if s == nil || p < s.base() || p >= s.limit || s.state != mSpanInUse". Seems quite strange to me :)

obraz

krasi-georgiev commented 5 years ago

yeah that looks strange. could it be the s.base() func call?

maybe when it is a big map it takes time to find the key , but it is really strange to take more than 3 minutes.

Maybe post an issue in the golang github or ask in the IRC channel. I would be interested to know the answer as well.

radeklesniewski commented 5 years ago

@krasi-georgiev I think we can close the issue. New implementation of sorting operation improved our performance a bit and we are close to achieving our main goal with sparse metrics.

Additionally, I've created a PR (https://github.com/prometheus/tsdb/pull/503) with additional logging about duration of func (c *LeveledCompactor) write(...) execution in compact.go file - it was very useful during performance tests and may be useful for others as well.

krasi-georgiev commented 5 years ago

ok , thanks for the update. Glad you are close to a solution.

prometheus-junkyard / tsdb

Memory issues on short block sizes #480