Closed bvaughn closed 9 years ago
I think additional gains could be had by converting other SortedSet
instances to objects (e.g. lunr.Index
's tokens). I'd be happy to submit a PR with this broader change, along with some JS perf benchmarks, if you'd like.
Agreed, there is no real need for the stopWordFilter
to be using a SortedSet
. I think the reason it has lasted here so long is that, in my testing, it doesn't play that large of a part in the overall performance.
That said, I mostly test in Chrome and, according to your benchmarks, there isn't a massive difference there, Firefox on the other hand has such a crazy difference its definitely worth doing.
I assume you mean the idx.corpusTokens
when you mention the other places using SortedSet
? In this case we really do need the sorted property of SortedSet
as this is important in building the vectors used for the TF-IDF scoring. However, if it is possible to achieve the same thing with a data structure that gives O(1) access etc that would be awesome.
Ah, good point on the sorting. I'm just starting to implement tf-idf in Js Search so maybe once I finish I'll have a clearer idea of how best to approach it. I think it should still be possible though!
I posted a PR with the stop word changes. If I find a working solution for the token set change I'll post a follow-up CL. :)
On Monday, August 3, 2015, Oliver Nightingale notifications@github.com wrote:
Agreed, there is no real need for the stopWordFilter to be using a SortedSet. I think the reason it has lasted here so long is that, in my testing, it doesn't play that large of a part in the overall performance.
That said, I mostly test in Chrome and, according to your benchmarks, there isn't a massive difference there, Firefox on the other hand has such a crazy difference its definitely worth doing.
I assume you mean the idx.corpusTokens when you mention the other places using SortedSet? In this case we really do need the sorted property of SortedSet as this is important in building the vectors used for the TF-IDF scoring. However, if it is possible to achieve the same thing with a data structure that gives O(1) access etc that would be awesome.
— Reply to this email directly or view it on GitHub https://github.com/olivernn/lunr.js/issues/169#issuecomment-127365699.
Gentle ping :)
Sorry for the delay, hectic week at work. I've pushed your changes now in the latest release.
The stop words filter is currently implemented using a
SortedSet
andindexOf
method. A faster method would be to use anObject
and its keys directly.Here is a JS perf demonstrating the performance gains: http://jsperf.com/lunr-stop-words-object