Closed kstaken closed 2 years ago
Hi @kstaken. Just curious, may I know your current index size total?
Ah that might be an easy/silly one.
Flushing some information here. Contrary to tantivy, in quickwit all of the data to be in anonymous memory in order to run a query on a given individual split. This can be a lot of short-lived memory usage. Searching do not cache any data on disk at the moment.
We mitigated this problem on the stream search side by introducing a semaphore that just makes sure we don't run search on all splits at the same time.
We however forgot to add such a semaphore on normal search. Let me push a fix today.
@kstaken could you give this PR a try? https://github.com/quickwit-oss/quickwit/pull/1680
This fix is a bit naive. It limits the number of split searches that can run concurrently. Ideally we probably want to make use a memory permit for this, but it is a bit tricky to do it without causing a deadlock.
@fulmicoton We'll try that out.
FYI, we had also updated to 0.3.1 and data is now up to 20.7B records which is roughly 9TB of indexed data. At this size non timestamp constrained queries now fail to complete. We'll see how it behaves on the branch.
The PR link was apparently broken. I fixed it.
The PR introduces a new param in the searcher called max_num_concurrent_split_searchs
.
It should be sufficient to at least work... but the default has been thought for Amazon S3.
On MinIO you can probably use a lower figure (e.g. max_num_concurrent_split_searchs: 10
).
It might yield better performance and a lower memory footprint.
That PR definitely does improve the situation but we're still seeing intermittent OOMs. Clearly there will be a practical single node limit on searches but I wouldn't expect that to manifest when only retrieving a single record until processing a much larger number of splits. We currently have just over 2000 splits.
Queries will complete right after the node restarts and then some number of queries later it will fail. Outside of the caches is there any other state that is expected to build up as queries are executed?
You can see in this graph how the memory usage steps up. Each step corresponds with a new execution of the same query. As it gets near 32GB it does seem to free something but still eventually gets killed.
This issue should be reopened.
Here's another graph from a little more controlled test. This started with a fresh node, ran 10 identical queries serially, paused for 2 minutes then attempted to run 10 more of the same query. It ran out of memory on the 5th query in the second batch. The pause was somewhere around 00:40 to about 00:42 and it seems memory usage did go down slightly in that period.
This is very abnormal indeed. I'll investigate today.
I did a small test on hdfs (4splits), and the memory profile was flat. The small spike is the query, which hits 6M results. Maybe it's related to a high number of splits.
I could reproduce a leak (800 splits hdfs), which seems to be coming from tracing. But the leak is a magnitude smaller.
@PSeitz Good job, that could be it indeed. BTW I'm indexing 20B mini-docs on s3://quickwit-dev/oom. Right now it has 513 splits.
I have not been able to reproduce so far. I tried with a 6 billion doc index, querying a term query that select 0.4% of the docs (uniform distirbution in the doc id space). I resume indexing to go to 20 billions doc. That's on Amazon S3. My resident memory never exceeds 700MB and does not show an obvious drift.
@kstaken Can you share your index config?
@kstaken @PSeitz Your memory usage is virtual memory or resident memory?
The tool I used is bytehound
LD_PRELOAD=./libbytehound.so quickwit run --config quickwit.yaml
. It tracks allocations and deallocations. Given these are many small allocations, it's probably resident memory in my case.
I tried to reproduce with Amazon S3 on the nightly build.
Some details 1588 splits, ~16Billions docs.
I run in a loop a query matching 5% of the docs and returning only one document.
The query returns in <10s and matches 796,560,771 documents.
The one key difference is that because my docs are tiny, the hotcache is tiny too. I'll look into that.
@kstaken Can you confirm you do not have a fast field, sorting, or start-end range being enforced? I don't quite grasp where the 28GB can be coming from.
If there is a timestamp, by default we sort by timestamp... That probably explains why this is so slow and why it takes that much memory.
@kstaken. Can you try a lower value (2 times the number of threads on your CPU for instance) for max_num_concurrent_split_streams and let us know if the result look a little bit better?
We can probably a lot of small things to improve the situation
@fulmicoton unfortunately since we're testing with customer data I can't share the index config directly but it does have fast fields and a timestamp field.
The query that is being run is very simple /api/v1/index/search?query=field:value&max_hits=1
and the field being searched is a raw text field.
The memory usage here is resident usage and is constrained by K8s memory limits on the pod. There's no swap on any of our nodes.
If there is default sorting on timestamp then that does likely account for the elevated usage.
We have 8 cores allocated to the searchers so we'll try setting max_num_concurrent_split_streams to 16 and run the same test.
Here's a test with max_num_concurrent_split_streams
set to 16 which is 2 x provisioned cores. In this case the queries started failing after 8 attempts but data size had increased to 30.5B records and 2,575 splits.
I also included the CPU graph as there's some questionable things there as well. In particular we never see usage beyond 4 cores when the pods have 8 cores available.
When CPU hits 0 at the end is when the node was killed. The memory not going down is some kind of metrics glitch as the pod was automatically restarted.
max_num_concurrent_split_streams
has no impact on your query. Can you confirm you modified max_num_concurrent_split_searchs
?
Extra disclaimer: it was called max_num_concurrent_split_searchs
for a bit in the unmerged PR. If your binary wsa not built from main there, it could be max_num_concurrent_split_searches
.
Also we used to not spot mispelled parameter and just ignore them. It was fixed super recently.
I think I understand the discrepancy... The culprit is.... malloc.
I created a 1180 splits index with a timestamp field. The timestamp field is heavily compressed. (second resoltuion, and it assumes events spreading over a few days and documents that are sorted).
I am searching and sorting by the timestamp field. Single term query of 5%. I pause 2 seconds after each search to spot if we properly release memory.
The fast field cache is 200MB. The hotcache is 10MB. The binary is of around 58BM.
The problem was malloc...
Here is the RSS. Malloc is on the left. On the right I switched to jemalloc. I will split my work in several PR and ship it rapidly
@kstaken jmalloc is now in main. If you have time, can you confirm that it solves your problem?
Nice find. We'll do a new build and test it today.
Some good news, some bad.
Memory usage looks much more sane now and aligns with what we would expect for the cache settings in use. That's a big win! 👍
The bad news is that query times are 69-72 seconds with no evidence of caching making any difference. Previously the first query would be in the 60 second range but subsequent queries dropped to around 30 seconds.
We also tried doubling the cache sizes to confirm we weren't simply churning the cache and it made no difference.
It also never uses much beyond 2 CPUs but maybe that makes sense since it's using async concurrency for the split searches. Previously it was using 4 CPUs.
@kstaken Closing this ticket and opening another one.
Describe the bug
We've now loaded Quickwit with 16.4 Billion records and have started to trigger some out of memory (OOM) failures. In addition we've also seen clustering issues so to isolate the OOMs we scaled down to a single search node.
The test query matches 67 million records but there's no sorting, timestamps or anything complicated on the query, just a single criteria i.e.:
field:value
and max_hits=1.On a single searcher node this query will run successfully in about 38 seconds on the first run, 25 seconds on the second run and then consistently OOM on the third. Queries are not concurrent and no other queries are submitted between subsequent runs.
In this case it's the kernel killing Quickwit since it's exceeding the memory limit allocated. The searcher is running in Kubernetes with 32 GB of RAM allocated.
Configuration:
This index currently has 1,340 splits, with 10M doc target per split.
Memory and cache settings are the defaults.