Lunr index building is killed with OOM error

emirotin commented 6 years ago

Original report is here https://github.com/olivernn/lunr.js/issues/305#issuecomment-336872069

Link to repro repo: https://github.com/emirotin/lunr-failure. You need to untar theJSON dump file before running the code.

The process is using around 1.5GB RAM when it crashes. I'm on MBP 2017 with 16GB RAM and the overall memory pressure when running the tests is pretty low (App Memory combined is about 8GB, the chart is all green).

olivernn commented 6 years ago

I get similar results, it gets to about 54K entries and then massively slows down, I got bored waiting for it to crash...

I re-ran the command again, but this time increased the amount of space available to node to 4GB:

$node --max-old-space-size=4096 build-index.js

It got to 156K entries before slowing down again, so I think you just need to increase the amount of memory you give node.

My guess is it slows down as the inverted index gets large and it starts to run out of available heap space because v8 is having to do a lot of heavy lifting making space for the inverted index, probably its memory space is getting more and more fragmented.

If your data contains multiple languages my guess is that using the appropriate stemmers and stop words is significantly reducing the amount of data that needs to enter the index, hence it manages to get further before node runs out of space.

Out of interest, what is your use case for this index? I'm interested to know what kind of query performance you manage to get, if it ever manages to complete building the index... At this scale you might be pushing the limits of what Lunr is capable of, certainly the index that it might eventually generate is going to be very large, probably too large to send to a browser?

emirotin commented 6 years ago

Thanks for your thoughtful investigation!

It's quite possible indeed that I've hit the limits of Lunr and have to use another solution, though Lunr is really easy to work with and I had good experience using it before (with much smaller amount of doc though), so it was my first candidate.

Here's my use-case. There's a remote DB of docs of some sort (essentially they're questions similar to those for pub-quizzes). The authors often use it to check if the new questions contain the facts or ideas previously used by other authors. It has several issues:

may be slow or down sometimes
if somebody can intercept the queries they can guess the themes for the future questions
the built-in search itself is known to sometimes miss the results

The questions are mostly in Russian, with specific words in English, more seldom in Latin, French, Italian, German, etc.

So I want to create the all-local apps to run the searches against this DB. For that I'm (going to be) 1) creating the replica of this DB on my server (this is already working, I have a SQLite DB with all the questions) 2) indexing this replica and building the search index suitable for full-text search with specific features (being able to only search by specific fields, powerful language-aware stemming, optionally but nice to have fuzzy search with acceptable word distance) 3) creating the local apps (first desktop, then potentially mobile) that will periodically download the search index (and maybe the SQLite DB as well) and run all the searches locally, offline, against the local index / DB. The app can download the updates in the background, so it's not that bad if it has to fetch 500M every now and then (for the desktop app, I didn't think about the mobile strategy yet).

As JS is my language of choice I'd like to have the server in Node.js, desktop app in Electron / React, and mobile apps in React Native, so some JS-first solution is really nice to have :)

The provided dump is the current state of the production data, and it's slowly growing over the time, so I need the scalability up to at least 1M docs.

olivernn commented 6 years ago

Interesting, thanks for the context. So I think it might be pushing Lunr too far, I've not heard of people with such large indexes. That said, I'm now interested to see how far we can push the current implementation!

From the testing that I've done so far it looks like the bottleneck is the GC, and allocation size. I reckon getting indexing to first complete, and then not take hours is going to be an exercise in tuning v8 and its GC. I'm slightly concerned that it is slowing down so much while just adding the documents, it hasn't got to the part where it constructs the TokenSet which might also put a lot of extra pressure on the GC.

My next steps is going to launch an instance with a decent amount of memory and give it all to node. How any documents are in the dump that is part of that repo?

Even if we don't manage to get a working solution, I'm sure we'll find a bunch of optimisations that might benefit more moderate indexes.

emirotin commented 6 years ago

:100: I managed to finish the indexing today with 8GB or 10GB max memory limit for Node (and it takes about 6-7 minutes), and the final index construction from the builder takes negligible time. But then serializing the index to JSON takes too much time unfortunately. I've been able to make it work with bfj but it's been taking too much time to stream it to disk so I had to abort it. I also expect reading such big index will also take ages though I may be wrong.

ATM I'm looking into alternative approach (SQLite which I already have as the intermediary storage + FTS5 + Snowball stemmer), but I'd be happy to get Lunr working or at least help improve it. It's definitely the easiest and friendliest FTS in JS world.

The provided dump has all the 300K+ docs that exist in the DB as of now.

olivernn commented 6 years ago

I noticed while adding some more logging to the script that the inverted index size was very large, e.g. >400K keys. This implies that there are over 400K unique tokens in your documents. Apparently there are ~ 200K english words in OED:

The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries.

Lets say there are the same for Russian (I don't know), that then gets us to the ~400K mark, but does you data actually contain most words in the dictionary? I don't know what kind of data it is, does that number of unique terms sound feasible?

Running locally I managed to get all the documents added to the index, but then trying to serialise the index just hangs. I suspect that JSON.stringify isn't really cutout for such large objects, I think this is what you found too based on your previous comment...

emirotin commented 6 years ago

Russian dics may indeed contain up to 200K words (that's the case for the Dahl's dictionary), but that including lots of obsolete, local, and not often used words, and including derivatives.

My docs are written in modern natural language and are usually short (1-3 sentences). They also contain sources which are bibliographical references and URLs. They sometimes contain non-Russian words, but that are individual terms or titles that shouldn't significantly contribute to the number of unique tokens. So overall it's very unlikely for my corpus to really contain 400K unique tokens.

And yes wrt serialization, as I wrote in the prev comment I've found the way to serialize it with bfj lib, but it's unfeasibly slow.

On Thu, Oct 19, 2017 at 8:09 PM Oliver Nightingale notifications@github.com wrote:

I noticed while adding some more logging to the script that the inverted index size was very large, e.g. >400K keys. This implies that there are over 400K unique tokens in your documents. Apparently there are ~ 200K english words in OED:

The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries. https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language

Lets say there are the same for Russian (I don't know), that then gets us to the ~400K mark, but does you data actually contain most words in the dictionary? I don't know what kind of data it is, does that number of unique terms sound feasible?

Running locally I managed to get all the documents added to the index, but then trying to serialise the index just hangs. I suspect that JSON.stringify isn't really cutout for such large objects, I think this is what you found too based on your previous comment...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/olivernn/lunr.js/issues/306#issuecomment-337974293, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgGCJh554NykD6AwFOCFTw3HwbR4MPtks5st4JDgaJpZM4P6mt7 .

mitra42 commented 4 years ago

Yes - lunr looked really good at first glance, simple to integrate and with the right kinds of features, but someone else noticed that its holding the index in memory, so looks like a no-go unless someone managed to integrate it with leveldb or something similar.

Did any of you who commented above find an alternative (javascript) search engine to integrate ?

gavinr commented 4 years ago

These errors (FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory) are happening to me too. Error happens for me on the ~160th time calling this.add(doc);. Each of my documents is has a large "body" (think long blog post). Is this expected?

Node.js: v10.15.3

olivernn / lunr.js

Lunr index building is killed with OOM error #306