Tokenizer performance improvement

satchmorun commented 9 years ago

I know that the CONTRIBUTING doc says to open an issue first, but this was a small change (and just a refactor). Please feel free to close if it's not inbounds.

Made a couple of changes to the tokenizer that make the implementation shorter and faster:

Use String.prototype.trim
Change the regex so that filtering out empty matches is not necessary
Call toLowerCase once instead of repeatedly in map

Removing the calls to map and filter results in pretty big speed improvements (2X-4X), and probably (haven't tested it) lower memory usage.

http://jsperf.com/stringsplitting

olivernn commented 9 years ago

Thanks a lot for this change. In isolation it is a massive improvement, the gains are more modest when the function is used in searches and adding but it still moves the performance in the right direction! I've cut a new release, 0.5.8 that includes this change. I also added a performance test for this function so that its easier to show future performance improvements/regressions.

satchmorun commented 9 years ago

@olivernn

Very cool, thanks. And absolutely, the gains are modest when searching and adding, but measurable, which is nice.

One thing I wanted to suggest with respect to tokenizing: what do you think about changing the line:

return obj.toString().trim().toLowerCase().split(/[\s\-]+/)

to be

return obj.toString().toLowerCase().match(/\w+/g)

in order to strip out more "special" characters.

> obj = '[the](quick brown! fox)'
'[the](quick brown! fox)'
> obj.toString().trim().toLowerCase().split(/[\s\-]+/)
[ '[the](quick', 'brown!', 'fox)' ]
> obj.toString().toLowerCase().match(/\w+/g)
[ 'the',  'quick',  'brown',  'fox' ]

I know that I can set up my own pipeline, but it feels like this is going to produce better results for the stopword/stemming steps down the line, so might be worth making official.

olivernn / lunr.js

Tokenizer performance improvement #139