Lunr 2.1 update & configurable indexed and template fields

nkuehn commented 7 years ago

Hi @slashdotdash starting from my issue #117 I found that a migration to lunr 2.0.x can reduce the index size much more massively, so I started developing and testing a bit.
Here's the result. It required a bit of shuffling around in the indexer because the new lunr index is immutable so we need to go via the Builder class if we don't want to store the complete documents in ruby memory in parallel.

Index size in the site I test with reduced from 1.2 MB to 490kb (means we can consider using lunr at all now)
No client JS changes necessary as far as I see - but projects that have a hardcopy of lunr.js in their project will break due to completely incompatible index. So it should be worth setting a higher version number and some more documentation. This is a project release policy topic, leave that up to you and I haven't done any changes to the gem description etc. for that reason.

to everyone: please test, this project has no built-in tests so we need a bunch of feeback from real-world sites.

slashdotdash commented 7 years ago

Awesome work @nkuehn. Thanks for taking the time to get this done. I'll test it out locally and get it merged in and released.

nkuehn commented 7 years ago

Better stop testing in depth - while researching the weird behavior of the results I stumbled over https://github.com/olivernn/lunr.js/issues/263 , there learning that field boosting was moved to query time in the new index structure.

It's an improvement, but leads to no field being boosted at all now, esp. the title not playing any role.

I'll have to touch the client code, too as it looks. Alternatively wait for lunr.js 2.1, which introduces per-field vectors in the index and behaves pretty good without any boosting at all.

nkuehn commented 7 years ago

@slashdotdash Lunr.js has released 2.1 to production now ( https://github.com/olivernn/lunr.js/commit/cf96052b82426eb84302b64797e498aabb681e59 ) and I am pretty happy with the results I see, especially in comparison to the 2.0.x series.

So I'm skipping 2.0.x altogether for this upgrade. I have it in use on our site and am happy with the stability, but haven't actively played with other configurations (lack of available sites to test with).

The key changes here are:

(lunr 2.1 incurred): No index-time field boosting available any more. Boosting can be done query-time in the search expression language, but I have not felt any need in my index since the fields are well balanced automatically in the new index structure.
(indirectly 2.1 incurred): Since lunr 2.1 has a term vector per field, the number of choice of indexed fields becomes more important than other factors. So I introduced the ability to configure which of the built-in or any other custom front matter fields are indexed at all. Defaults are backwards-compatible if all is right.
IF a user has overridden the field boosting in _config.yml (possible but not configured) the _config.yml is not compatible any more since the structure is an array now instead of object (to reflect the lunr.js 2.1 config)

Client side was deliberately kept compatible although it would be nice to support some more of the query language features and make it easier to integrate into a bigger site as a JS dependency.

slashdotdash commented 7 years ago

@nkuehn Sorry I haven't made time to merge your pull request.

Would you be interested in becoming a contributor to this project so that you can merge PRs yourself?

nkuehn commented 6 years ago

Hi @slashdotdash It's pretty sure now that I won't be contributing any more, so that won't help - I have not been able to tune the underlying lunr.js good enough to match the use case and content size of the site that's driving my motivation here. We switched to a SaaS search offering now.

slashdotdash commented 6 years ago

@nkuehn No problem, thanks for letting me know.

slashdotdash / jekyll-lunr-js-search

Lunr 2.1 update & configurable indexed and template fields #118