olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

Improve performance of stop-words filter #169

Closed bvaughn closed 9 years ago

bvaughn commented 9 years ago

The stop words filter is currently implemented using a SortedSet and indexOf method. A faster method would be to use an Object and its keys directly.

Here is a JS perf demonstrating the performance gains: http://jsperf.com/lunr-stop-words-object

bvaughn commented 9 years ago

I think additional gains could be had by converting other SortedSet instances to objects (e.g. lunr.Index's tokens). I'd be happy to submit a PR with this broader change, along with some JS perf benchmarks, if you'd like.

olivernn commented 9 years ago

Agreed, there is no real need for the stopWordFilter to be using a SortedSet. I think the reason it has lasted here so long is that, in my testing, it doesn't play that large of a part in the overall performance.

That said, I mostly test in Chrome and, according to your benchmarks, there isn't a massive difference there, Firefox on the other hand has such a crazy difference its definitely worth doing.

I assume you mean the idx.corpusTokens when you mention the other places using SortedSet? In this case we really do need the sorted property of SortedSet as this is important in building the vectors used for the TF-IDF scoring. However, if it is possible to achieve the same thing with a data structure that gives O(1) access etc that would be awesome.

bvaughn commented 9 years ago

Ah, good point on the sorting. I'm just starting to implement tf-idf in Js Search so maybe once I finish I'll have a clearer idea of how best to approach it. I think it should still be possible though!

I posted a PR with the stop word changes. If I find a working solution for the token set change I'll post a follow-up CL. :)

On Monday, August 3, 2015, Oliver Nightingale notifications@github.com wrote:

Agreed, there is no real need for the stopWordFilter to be using a SortedSet. I think the reason it has lasted here so long is that, in my testing, it doesn't play that large of a part in the overall performance.

That said, I mostly test in Chrome and, according to your benchmarks, there isn't a massive difference there, Firefox on the other hand has such a crazy difference its definitely worth doing.

I assume you mean the idx.corpusTokens when you mention the other places using SortedSet? In this case we really do need the sorted property of SortedSet as this is important in building the vectors used for the TF-IDF scoring. However, if it is possible to achieve the same thing with a data structure that gives O(1) access etc that would be awesome.

— Reply to this email directly or view it on GitHub https://github.com/olivernn/lunr.js/issues/169#issuecomment-127365699.

bvaughn commented 9 years ago

Gentle ping :)

olivernn commented 9 years ago

Sorry for the delay, hectic week at work. I've pushed your changes now in the latest release.