OOMs after repeated queries on larger amounts of data.

kstaken commented 2 years ago

Describe the bug

We've now loaded Quickwit with 16.4 Billion records and have started to trigger some out of memory (OOM) failures. In addition we've also seen clustering issues so to isolate the OOMs we scaled down to a single search node.

The test query matches 67 million records but there's no sorting, timestamps or anything complicated on the query, just a single criteria i.e.:field:value and max_hits=1.

On a single searcher node this query will run successfully in about 38 seconds on the first run, 25 seconds on the second run and then consistently OOM on the third. Queries are not concurrent and no other queries are submitted between subsequent runs.

In this case it's the kernel killing Quickwit since it's exceeding the memory limit allocated. The searcher is running in Kubernetes with 32 GB of RAM allocated.

Configuration:

This index currently has 1,340 splits, with 10M doc target per split.

# quickwit --version
Quickwit 0.3.0 (commit-hash: 6d07599)

Memory and cache settings are the defaults.

searcher:
  fast_field_cache_capacity: 10G
  split_footer_cache_capacity:  1G
  max_num_concurrent_split_streams: 100

honiewelle commented 2 years ago

Hi @kstaken. Just curious, may I know your current index size total?

fulmicoton commented 2 years ago

Ah that might be an easy/silly one.

Flushing some information here. Contrary to tantivy, in quickwit all of the data to be in anonymous memory in order to run a query on a given individual split. This can be a lot of short-lived memory usage. Searching do not cache any data on disk at the moment.

We mitigated this problem on the stream search side by introducing a semaphore that just makes sure we don't run search on all splits at the same time.

We however forgot to add such a semaphore on normal search. Let me push a fix today.

fulmicoton commented 2 years ago

@kstaken could you give this PR a try? https://github.com/quickwit-oss/quickwit/pull/1680

This fix is a bit naive. It limits the number of split searches that can run concurrently. Ideally we probably want to make use a memory permit for this, but it is a bit tricky to do it without causing a deadlock.

kstaken commented 2 years ago

@fulmicoton We'll try that out.

FYI, we had also updated to 0.3.1 and data is now up to 20.7B records which is roughly 9TB of indexed data. At this size non timestamp constrained queries now fail to complete. We'll see how it behaves on the branch.

fulmicoton commented 2 years ago

The PR link was apparently broken. I fixed it. The PR introduces a new param in the searcher called max_num_concurrent_split_searchs.

It should be sufficient to at least work... but the default has been thought for Amazon S3. On MinIO you can probably use a lower figure (e.g. max_num_concurrent_split_searchs: 10).

It might yield better performance and a lower memory footprint.

kstaken commented 2 years ago

That PR definitely does improve the situation but we're still seeing intermittent OOMs. Clearly there will be a practical single node limit on searches but I wouldn't expect that to manifest when only retrieving a single record until processing a much larger number of splits. We currently have just over 2000 splits.

Queries will complete right after the node restarts and then some number of queries later it will fail. Outside of the caches is there any other state that is expected to build up as queries are executed?

You can see in this graph how the memory usage steps up. Each step corresponds with a new execution of the same query. As it gets near 32GB it does seem to free something but still eventually gets killed.

kstaken commented 2 years ago

This issue should be reopened.

Here's another graph from a little more controlled test. This started with a fresh node, ran 10 identical queries serially, paused for 2 minutes then attempted to run 10 more of the same query. It ran out of memory on the 5th query in the second batch. The pause was somewhere around 00:40 to about 00:42 and it seems memory usage did go down slightly in that period.

fulmicoton commented 2 years ago

This is very abnormal indeed. I'll investigate today.

PSeitz commented 2 years ago

I did a small test on hdfs (4splits), and the memory profile was flat. The small spike is the query, which hits 6M results. Maybe it's related to a high number of splits.

profile

PSeitz commented 2 years ago

I could reproduce a leak (800 splits hdfs), which seems to be coming from tracing. But the leak is a magnitude smaller. leak_k

leak_stack

fulmicoton commented 2 years ago

@PSeitz Good job, that could be it indeed. BTW I'm indexing 20B mini-docs on s3://quickwit-dev/oom. Right now it has 513 splits.

fulmicoton commented 2 years ago

I have not been able to reproduce so far. I tried with a 6 billion doc index, querying a term query that select 0.4% of the docs (uniform distirbution in the doc id space). I resume indexing to go to 20 billions doc. That's on Amazon S3. My resident memory never exceeds 700MB and does not show an obvious drift.

PSeitz commented 2 years ago

@kstaken Can you share your index config?

fulmicoton commented 2 years ago

@kstaken @PSeitz Your memory usage is virtual memory or resident memory?

PSeitz commented 2 years ago

The tool I used is bytehound LD_PRELOAD=./libbytehound.so quickwit run --config quickwit.yaml. It tracks allocations and deallocations. Given these are many small allocations, it's probably resident memory in my case.

fulmicoton commented 2 years ago

I tried to reproduce with Amazon S3 on the nightly build.

plot

fulmicoton commented 2 years ago

Some details 1588 splits, ~16Billions docs.

I run in a loop a query matching 5% of the docs and returning only one document.

The query returns in <10s and matches 796,560,771 documents.

The one key difference is that because my docs are tiny, the hotcache is tiny too. I'll look into that.

fulmicoton commented 2 years ago

@kstaken Can you confirm you do not have a fast field, sorting, or start-end range being enforced? I don't quite grasp where the 28GB can be coming from.

fulmicoton commented 2 years ago

If there is a timestamp, by default we sort by timestamp... That probably explains why this is so slow and why it takes that much memory.

@kstaken. Can you try a lower value (2 times the number of threads on your CPU for instance) for max_num_concurrent_split_streams and let us know if the result look a little bit better?

We can probably a lot of small things to improve the situation

report some breakdown of how the memory is used
report how much memory is held in caches
improve the split execution order (prioritize the splits that have their data already in cache?)

kstaken commented 2 years ago

@fulmicoton unfortunately since we're testing with customer data I can't share the index config directly but it does have fast fields and a timestamp field.

The query that is being run is very simple /api/v1/index/search?query=field:value&max_hits=1and the field being searched is a raw text field.

The memory usage here is resident usage and is constrained by K8s memory limits on the pod. There's no swap on any of our nodes.

If there is default sorting on timestamp then that does likely account for the elevated usage.

We have 8 cores allocated to the searchers so we'll try setting max_num_concurrent_split_streams to 16 and run the same test.

kstaken commented 2 years ago

Here's a test with max_num_concurrent_split_streams set to 16 which is 2 x provisioned cores. In this case the queries started failing after 8 attempts but data size had increased to 30.5B records and 2,575 splits.

I also included the CPU graph as there's some questionable things there as well. In particular we never see usage beyond 4 cores when the pods have 8 cores available.

When CPU hits 0 at the end is when the node was killed. The memory not going down is some kind of metrics glitch as the pod was automatically restarted.

fulmicoton commented 2 years ago

max_num_concurrent_split_streams has no impact on your query. Can you confirm you modified max_num_concurrent_split_searchs?

Extra disclaimer: it was called max_num_concurrent_split_searchs for a bit in the unmerged PR. If your binary wsa not built from main there, it could be max_num_concurrent_split_searches. Also we used to not spot mispelled parameter and just ignore them. It was fixed super recently.

fulmicoton commented 2 years ago

I think I understand the discrepancy... The culprit is.... malloc.

I created a 1180 splits index with a timestamp field. The timestamp field is heavily compressed. (second resoltuion, and it assumes events spreading over a few days and documents that are sorted).

I am searching and sorting by the timestamp field. Single term query of 5%. I pause 2 seconds after each search to spot if we properly release memory.

The fast field cache is 200MB. The hotcache is 10MB. The binary is of around 58BM.

The problem was malloc...

Here is the RSS. Malloc is on the left. On the right I switched to jemalloc. I will split my work in several PR and ship it rapidly

fulmicoton commented 2 years ago

@kstaken jmalloc is now in main. If you have time, can you confirm that it solves your problem?

kstaken commented 2 years ago

Nice find. We'll do a new build and test it today.

kstaken commented 2 years ago

Some good news, some bad.

Memory usage looks much more sane now and aligns with what we would expect for the cache settings in use. That's a big win! 👍

The bad news is that query times are 69-72 seconds with no evidence of caching making any difference. Previously the first query would be in the 60 second range but subsequent queries dropped to around 30 seconds.

We also tried doubling the cache sizes to confirm we weren't simply churning the cache and it made no difference.

It also never uses much beyond 2 CPUs but maybe that makes sense since it's using async concurrency for the split searches. Previously it was using 4 CPUs.

fulmicoton commented 2 years ago

@kstaken Closing this ticket and opening another one.

quickwit-oss / quickwit

OOMs after repeated queries on larger amounts of data. #1679