Convert stop words filter from SortedSet/Array to Object

bvaughn commented 9 years ago

The stop words filter is currently implemented using a SortedSet and indexOf method. A faster method would be to use an Object and its keys directly.

Here is a JS perf demonstrating the performance gains: http://jsperf.com/lunr-stop-words-object

Existing stop words tests pass after this change.

olivernn commented 9 years ago

This looks like a good start, though there are a couple of things that need to be handled with care. JavaScript objects are not great at being hashes. In this case anything on Object.prototype would be considered a stop word.

The best way to fix this is by creating the stopWords object with Object.create(null), however, as far as I'm aware this is not possible to shim.

The second way, and this is the way the idfCache is implemented, is to prefix all keys with some character to avoid any clashes with builtins. I've updated the benchmarks with a change like this. Might be worth running them a few times as Chrome seems to be slower now?

I think it might also be worth having this available as a function that creates stopWordFilters, just to make it simple for end users to provide their own stop word lists easily, something like:

lunr.stopWordFilter = lunr.stopWordFilterBuilder(["long", "list", "of", "stop", "words"])

What do you think?

bvaughn commented 9 years ago

Ah, that's a fair point.

I wonder if the reason for the slow-down in the updated example is the string concatenation?

I've added a newer version of the benchmark that removes the string concatenation while (I think) also addressing your Object.prototype concerns. Chrome performance is back to being in the green.

bvaughn commented 9 years ago

PS. If you are satisfied with the updated approach (demonstrated in the JS Perf test) I'll make the corresponding changes to my PR. :)

bvaughn commented 9 years ago

By the way, I've updated my pull request and added an additional test to cover the Object.prototype concerns you mentioned.

olivernn commented 9 years ago

Your changes are in the latest release of lunr. I added a small change to quote all the keys of the object. Thanks again for your work on this one!

olivernn / lunr.js

Convert stop words filter from SortedSet/Array to Object #170