Closed satchmorun closed 9 years ago
Thanks a lot for this change. In isolation it is a massive improvement, the gains are more modest when the function is used in searches and adding but it still moves the performance in the right direction! I've cut a new release, 0.5.8 that includes this change. I also added a performance test for this function so that its easier to show future performance improvements/regressions.
@olivernn
Very cool, thanks. And absolutely, the gains are modest when searching and adding, but measurable, which is nice.
One thing I wanted to suggest with respect to tokenizing: what do you think about changing the line:
return obj.toString().trim().toLowerCase().split(/[\s\-]+/)
to be
return obj.toString().toLowerCase().match(/\w+/g)
in order to strip out more "special" characters.
> obj = '[the](quick brown! fox)'
'[the](quick brown! fox)'
> obj.toString().trim().toLowerCase().split(/[\s\-]+/)
[ '[the](quick', 'brown!', 'fox)' ]
> obj.toString().toLowerCase().match(/\w+/g)
[ 'the', 'quick', 'brown', 'fox' ]
I know that I can set up my own pipeline, but it feels like this is going to produce better results for the stopword/stemming steps down the line, so might be worth making official.
I know that the CONTRIBUTING doc says to open an issue first, but this was a small change (and just a refactor). Please feel free to close if it's not inbounds.
Made a couple of changes to the tokenizer that make the implementation shorter and faster:
String.prototype.trim
toLowerCase
once instead of repeatedly inmap
Removing the calls to
map
andfilter
results in pretty big speed improvements (2X-4X), and probably (haven't tested it) lower memory usage.http://jsperf.com/stringsplitting