olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

Serialize big indexes (get rid of TokenStore?) #85

Closed Pym closed 7 years ago

Pym commented 10 years ago

Hello there,

I'm building a browser based application (without a webserver). I'll have multiple documents to index (in my case, about 700) and the cumulated weight of those documents is about 5 MB.

For now, my application rebuilds the index every time the webpage is displayed, but it takes about 40 seconds to build the full index, which is very long.

I want to store the index in the browser but I can't even serialize it. Here is what I get when I try to JSON.stringify() my index:

Uncaught RangeError: Maximum call stack size exceeded

After a short investigation, the problem seems to come from JSON.stringify() which can't handle big object. In fact, after a heap snapshot, it occurs that my TokenStore is about ~ 70 MB..!

screen shot 2014-04-30 at 15 30 53

I will not even try to store such a big object in a browser, but maybe I'm looking in the wrong way here. Maybe there is a way to store the index without the TokenStore and rebuild it somehow?

Thank you!

olivernn commented 10 years ago

70MB is certainly quite large for the token store. I have a feeling that the error you are seeing when serialising is possibly more down to the structure of the TokenStore rather than its size, but I could be wrong.

The token store is fairly fundamental to how lunr works, it effectvely is the index of lunr, it maps tokens to documents. It might be possible to selectively serialise a lunr index, and not serialising the token store. You would then need to rebuild it based on the contents of the document store. I have a feeling that this might be more effort that it is worth though.

lunr.TokenStore is currently implemented as a trie, this allows lunr to support wildcard or prefix searches, e.g. A search for 'hel' would find documents with 'help', 'hello', 'held' etc.

The way this is currently implemented in JavaScript is as a massively nested object, e.g.

{
    'h': {
        'e': {
            'l': {
                'l': {
                    'o': {
                        docs: doc_id_references
                    }
                },
                'p': {
                    docs: doc_id_references
                },
                'd': {
                    docs: doc_id_references
                }
            }
        }
    }
}

This is a fairly simple implementation of a trie, and doesn't try to do anything clever to reduce its size. The benefits of this are simplicity, it also potentially allows different TokenStores to be merged with relative ease.

It is certainly possible to use a cleverer implementation of a trie, one that collapses down redundant nodes in the trie, perhaps something like this:

{
    'hel': {
        'lo': {
            docs: doc_id_references
        },
        'p': {
            docs: doc_id_references
        },
        'd': {
            docs: doc_id_references
        }
    }
}

It'd be really interesting to get some actual statistics about this but I have a feeling that an implementation like this would have a significant impact on the size of the TokenStore as well as reducing the stack size used when serialising etc.

I'm currently working on changes to the token store to better allow wildcard searches that could double the size of the token store. Storing a token reversed as well as non-reversed allows for more advanced searches such as "_oo", "f_o" and "fo*".

If you wanted to help a great contribution would be an implementation of a more efficient trie, take a look at lunr.Trie and its tests for the kind of interface it would need to support.

For now a simple work around would be to reduce the number of tokens that are being indexed, this can be done by adding words to lunr.stopWordFilter. You can probably get good candidates for words to add by looking for very common words in your index, the below snippet will get you the list of the 100 most common tokens in your index:

idx.corpusTokens
    .map(function (t) { return [t, Object.keys(idx.tokenStore.get(t)).length / idx.documentStore.length] })
    .sort(function (a, b) { return b[1] - a[1] })
    .slice(0, 100)

Very common tokens will not contribute much to document search scores and can be added to the stop word list. Its worth experimenting with adding words to the stop word list for large indexes, be sure to keep track of how it effects search results.

benpickles commented 10 years ago

This guy looked at different trie implementations http://ejohn.org/blog/javascript-trie-performance-analysis/

olivernn commented 10 years ago

I've been working on trying to reduce the size of token store. I've tried using a Radix Tree instead of the current Trie based implementation.

With my testing so far it reduces the in-memory size of the index by ~20%. Lookup performance doesn't seem to have been degraded by much (I still need to do some more testing around this though). The downside is that the implementation is significantly more complex and I need to spend some more time trying to simplify/understand it, as well as making sure the increased complexity is worth it for the reduced size. It would be a great help for me if you could provide me with the data that you are trying to index for me to test with.

I also experimented with ways to make the serialised index smaller. I'm trying to cut down the number of parts of an index that actually need to be serialised, and re-building the rest when loading the built index. This hasn't gone quite as well so far. For example, re-building the token trees when loading makes the load time about 100x slower. The example index goes from loading in ~2ms to around ~200ms.

JVillella commented 10 years ago

Has there been any follow up to this? I'm having an identical issue trying to index web pages by first stripping the inner text out of the body. 10 small sites is enough to cause the call stack size exceeded error.

olivernn commented 10 years ago

I haven't spent much time recently on lunr so haven't really managed to progress much on this.

Out of interest though, roughly what size is the index when you start seeing the call stack errors?

kig commented 8 years ago

I've been hacking on making the serialized index smaller as I'm trying to do client-side search of the v8 codebase (which is around 120 MB). I've got a small prototype of a packed index up at https://github.com/kig/lunr.js

The unpacked index is 180 MB (20 MB gzipped). The packed index is 20 MB (6.5 MB gzipped). And there's still a good chunk of compression possible. I played with quantizing the termFreqs and that alone dropped the gzipped file size to 1.4 megs.

I have a simple unit test on the serialiser, but haven't tested it on any real datasets.

vladimiry commented 8 years ago

@kig what about using client-side archiving, something like this http://pieroxy.net/blog/pages/lz-string/index.html? Also I noticed this The index size is about half size of lunr.js index file. on the https://github.com/weixsong/elasticlunr.js page.

kig commented 8 years ago

You can play with a live demo at https://tabletree.io - I ended up rolling a simplified search optimized for speed and index wire size. The index is 1MB gzipped and the searches take <20ms, so you get nice live updates as you type.

Client-side archiving might get you 10-20% wire size boost compared to gzip, but you end up spending more time decompressing the data, so I figured I'd just make the source data more gzip-friendly. Building the search index on the client-side is a non-starter as the corpus is a hundred megs in size -- need to build a compressed index off-line and transfer just that to the client.

olivernn commented 8 years ago

@kig that is an awesome live demo, reminds me of this scene in Jurassic Park!

Prinzhorn commented 7 years ago

I found that the size of the JSON is not the issue but rather the depth that causes the call stack size to exceed. In my case I had a few very long words that were indexed. It was caused by blindly using $('body').text(), which does not add a white space between the text nodes and can cause very long "words" which do not get tokenized. These result in a very deeply nested object.

I went with something like the following in our indexing script (which uses cheerio to index static html files)

$('body *').map(function() {
    return $(this).text();
}).get().join(' ');
olivernn commented 7 years ago

Three years later there is a major new version of Lunr, 2.x, which has a much simpler serialised structure. Upgrading to this latest version should hopefully alleviate any issues with trying to serialise the large, deeply nested, data structure that Lunr 0.x and 1.x have.