Pipeline performance improvement for large fields

olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

http://lunrjs.com

MIT License

8.96k stars 548 forks source link

Pipeline performance improvement for large fields #329

Closed andymcm closed 6 years ago

andymcm commented 6 years ago

First, thanks for this library, I'm finding it extremely useful.

I was seeing poor performance adding very large fields to an index, which I tracked down to an array concat operation in the pipeline. Here are some sample numbers from my environment:

Standard Lunr:

Add field with 4000 tokens: 250 ms
Add field with 30000 tokens: 8000 ms

With this fix:

Add field with 4000 tokens: 20 ms
Add field with 30000 tokens: 200 ms

Regards

Andy

olivernn commented 6 years ago

Interesting find! You could probably squeeze a bit more performance out of this method by replacing the Array#reduce with a plain for loop too.

Adding a simple performance test for this one case would be a great help too. You can take a look at the lunr.Vector performance test for an example. To run the performance test run make perf/pipeline_perf.js.

andymcm commented 6 years ago

Thanks for the feedback. I've changed to a for loop and added a pipeline performance test. Feel free to edit further while merging.

olivernn commented 6 years ago

Thanks for updating this, I've pulled these changes into master, will put together a release shortly.

olivernn commented 6 years ago

2.1.6 contains your performance fix.

MarcLoupias commented 6 years ago

Hello !

Does this improvement have an impact on the search speed or only on the index building speed ?

I use GitbookIO/plugin-lunr in a gitbook used as a reminder. I have a lot of markdown files there and some of them have lots of tokens. I wonder if the gitbook team merge GitbookIO/plugin-lunr#6 with lunr@2.1.6 i will see only a performance on the gitbook building process or if i will get also search execution performance boost.

andymcm commented 6 years ago

I think technically the change affects both search and indexing pipelines, but in practice it will only make a difference to indexing, not search (unless you have very unusual search behaviour)

olivernn commented 6 years ago

It certainly won't make search any slower! Although a lunr.Pipeline is part of the search path it usually processes many fewer tokens than when building indexes. Any performance improvements from the pipeline are still likely to be dwarfed by the time spent actually performing the search. As ever though, running benchmarks relevant to your use case/data set is the only way to know for sure.