Closed bvaughn closed 9 years ago
This looks like a good start, though there are a couple of things that need to be handled with care. JavaScript objects are not great at being hashes. In this case anything on Object.prototype
would be considered a stop word.
The best way to fix this is by creating the stopWords
object with Object.create(null)
, however, as far as I'm aware this is not possible to shim.
The second way, and this is the way the idfCache
is implemented, is to prefix all keys with some character to avoid any clashes with builtins. I've updated the benchmarks with a change like this. Might be worth running them a few times as Chrome seems to be slower now?
I think it might also be worth having this available as a function that creates stopWordFilters, just to make it simple for end users to provide their own stop word lists easily, something like:
lunr.stopWordFilter = lunr.stopWordFilterBuilder(["long", "list", "of", "stop", "words"])
What do you think?
Ah, that's a fair point.
I wonder if the reason for the slow-down in the updated example is the string concatenation?
I've added a newer version of the benchmark that removes the string concatenation while (I think) also addressing your Object.prototype
concerns. Chrome performance is back to being in the green.
PS. If you are satisfied with the updated approach (demonstrated in the JS Perf test) I'll make the corresponding changes to my PR. :)
By the way, I've updated my pull request and added an additional test to cover the Object.prototype
concerns you mentioned.
Your changes are in the latest release of lunr. I added a small change to quote all the keys of the object. Thanks again for your work on this one!
The stop words filter is currently implemented using a
SortedSet
andindexOf
method. A faster method would be to use anObject
and its keys directly.Here is a JS perf demonstrating the performance gains: http://jsperf.com/lunr-stop-words-object
Existing stop words tests pass after this change.