Closed Pym closed 7 years ago
70MB is certainly quite large for the token store. I have a feeling that the error you are seeing when serialising is possibly more down to the structure of the TokenStore rather than its size, but I could be wrong.
The token store is fairly fundamental to how lunr works, it effectvely is the index of lunr, it maps tokens to documents. It might be possible to selectively serialise a lunr index, and not serialising the token store. You would then need to rebuild it based on the contents of the document store. I have a feeling that this might be more effort that it is worth though.
lunr.TokenStore
is currently implemented as a trie, this allows lunr to support wildcard or prefix searches, e.g. A search for 'hel' would find documents with 'help', 'hello', 'held' etc.
The way this is currently implemented in JavaScript is as a massively nested object, e.g.
{
'h': {
'e': {
'l': {
'l': {
'o': {
docs: doc_id_references
}
},
'p': {
docs: doc_id_references
},
'd': {
docs: doc_id_references
}
}
}
}
}
This is a fairly simple implementation of a trie, and doesn't try to do anything clever to reduce its size. The benefits of this are simplicity, it also potentially allows different TokenStores to be merged with relative ease.
It is certainly possible to use a cleverer implementation of a trie, one that collapses down redundant nodes in the trie, perhaps something like this:
{
'hel': {
'lo': {
docs: doc_id_references
},
'p': {
docs: doc_id_references
},
'd': {
docs: doc_id_references
}
}
}
It'd be really interesting to get some actual statistics about this but I have a feeling that an implementation like this would have a significant impact on the size of the TokenStore as well as reducing the stack size used when serialising etc.
I'm currently working on changes to the token store to better allow wildcard searches that could double the size of the token store. Storing a token reversed as well as non-reversed allows for more advanced searches such as "_oo", "f_o" and "fo*".
If you wanted to help a great contribution would be an implementation of a more efficient trie, take a look at lunr.Trie
and its tests for the kind of interface it would need to support.
For now a simple work around would be to reduce the number of tokens that are being indexed, this can be done by adding words to lunr.stopWordFilter
. You can probably get good candidates for words to add by looking for very common words in your index, the below snippet will get you the list of the 100 most common tokens in your index:
idx.corpusTokens
.map(function (t) { return [t, Object.keys(idx.tokenStore.get(t)).length / idx.documentStore.length] })
.sort(function (a, b) { return b[1] - a[1] })
.slice(0, 100)
Very common tokens will not contribute much to document search scores and can be added to the stop word list. Its worth experimenting with adding words to the stop word list for large indexes, be sure to keep track of how it effects search results.
This guy looked at different trie implementations http://ejohn.org/blog/javascript-trie-performance-analysis/
I've been working on trying to reduce the size of token store. I've tried using a Radix Tree instead of the current Trie based implementation.
With my testing so far it reduces the in-memory size of the index by ~20%. Lookup performance doesn't seem to have been degraded by much (I still need to do some more testing around this though). The downside is that the implementation is significantly more complex and I need to spend some more time trying to simplify/understand it, as well as making sure the increased complexity is worth it for the reduced size. It would be a great help for me if you could provide me with the data that you are trying to index for me to test with.
I also experimented with ways to make the serialised index smaller. I'm trying to cut down the number of parts of an index that actually need to be serialised, and re-building the rest when loading the built index. This hasn't gone quite as well so far. For example, re-building the token trees when loading makes the load time about 100x slower. The example index goes from loading in ~2ms to around ~200ms.
Has there been any follow up to this? I'm having an identical issue trying to index web pages by first stripping the inner text out of the body. 10 small sites is enough to cause the call stack size exceeded error.
I haven't spent much time recently on lunr so haven't really managed to progress much on this.
Out of interest though, roughly what size is the index when you start seeing the call stack errors?
I've been hacking on making the serialized index smaller as I'm trying to do client-side search of the v8 codebase (which is around 120 MB). I've got a small prototype of a packed index up at https://github.com/kig/lunr.js
The unpacked index is 180 MB (20 MB gzipped). The packed index is 20 MB (6.5 MB gzipped). And there's still a good chunk of compression possible. I played with quantizing the termFreqs and that alone dropped the gzipped file size to 1.4 megs.
I have a simple unit test on the serialiser, but haven't tested it on any real datasets.
@kig what about using client-side archiving, something like this http://pieroxy.net/blog/pages/lz-string/index.html? Also I noticed this The index size is about half size of lunr.js index file.
on the https://github.com/weixsong/elasticlunr.js page.
You can play with a live demo at https://tabletree.io - I ended up rolling a simplified search optimized for speed and index wire size. The index is 1MB gzipped and the searches take <20ms, so you get nice live updates as you type.
Client-side archiving might get you 10-20% wire size boost compared to gzip, but you end up spending more time decompressing the data, so I figured I'd just make the source data more gzip-friendly. Building the search index on the client-side is a non-starter as the corpus is a hundred megs in size -- need to build a compressed index off-line and transfer just that to the client.
@kig that is an awesome live demo, reminds me of this scene in Jurassic Park!
I found that the size of the JSON is not the issue but rather the depth that causes the call stack size to exceed. In my case I had a few very long words that were indexed. It was caused by blindly using $('body').text()
, which does not add a white space between the text nodes and can cause very long "words" which do not get tokenized. These result in a very deeply nested object.
I went with something like the following in our indexing script (which uses cheerio to index static html files)
$('body *').map(function() {
return $(this).text();
}).get().join(' ');
Three years later there is a major new version of Lunr, 2.x, which has a much simpler serialised structure. Upgrading to this latest version should hopefully alleviate any issues with trying to serialise the large, deeply nested, data structure that Lunr 0.x and 1.x have.
Hello there,
I'm building a browser based application (without a webserver). I'll have multiple documents to index (in my case, about 700) and the cumulated weight of those documents is about 5 MB.
For now, my application rebuilds the index every time the webpage is displayed, but it takes about 40 seconds to build the full index, which is very long.
I want to store the index in the browser but I can't even serialize it. Here is what I get when I try to
JSON.stringify()
my index:After a short investigation, the problem seems to come from
JSON.stringify()
which can't handle big object. In fact, after a heap snapshot, it occurs that my TokenStore is about ~ 70 MB..!I will not even try to store such a big object in a browser, but maybe I'm looking in the wrong way here. Maybe there is a way to store the index without the TokenStore and rebuild it somehow?
Thank you!