Search performance & searching by longer words

Since implementing updated Japanese tokenizer (see https://github.com/riboseinc/lunr-repro/ & the accepted PR), and setting max n-gram size to 6, search index JSON size is pretty large (300 MB). N-gram 6 is not enough, because it limits searches to 6 characters max.

Following options are to be investigated:

Implementing multi-language Lunr search. If this works, we could perhaps combine tokenizers, so that, in user’s search query, Japanese words would have a lower maximum length, but English words (which are typically longer) could be long. This means we wouldn’t need to set high max n-gram in Japanese tokenizer, and could save on index size.

I am going to try this myself and see if it works out of the box & what it does to search index JSON size.
Compressing search index JSON. This should yield a massive benefit.

I am going to try this myself. (Github Pages doesn’t seem to support serving pre-gzipped files. However, we can still serve it compressed and unpack it when loading it in Lunr.)
Splitting search index, loading a smaller high-priority index that searches high-priority content only first, and then loading another index that searches full contents. Using multiple Lunr indices under the hood while providing a single unified search should be easy. The user wouldn’t even notice, except for a while after the initial load search results will not be complete (we can show “Loading…” to make it clear that more things could be found if you wait).

A sub-option here is to initialize the search index incrementally.
- We could use line-based JSON, read search index data as data is being received from the server line by line (I believe nothing special is needed server-side), and continuously refresh Lunr instance with more complete index on client-side until it’s fully fetched.
- A more thorough solution along these lines is to not pre-build the index at all on the server, but instead load the entire site content (compressed) on the client and build the index in the browser incrementally (when done using IndexedDB & service worker, it wouldn’t slow down the browser), possibly with the help of say https://www.npmjs.com/package/lunr-mutable-indexes (since Lunr out of the box does not allow modifying the index). Since we want to load site content in this way regardless (to provide faster browsing experience), I imagine we’ll use this approach eventually.
I am going to try this myself.
Some other search index optimization at tokenizer level that I don’t know about.

This is where a bit of external help could help us achieve faster results (cc @ronaldtse)

metanorma / firelight

Search performance & searching by longer words #14