Option to set the delimiter characters for tokenizing

olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

http://lunrjs.com

MIT License

8.96k stars 548 forks source link

Option to set the delimiter characters for tokenizing #102

Closed selfawaresoup closed 9 years ago

selfawaresoup commented 10 years ago

Currently, the default tokenizer only splits token on whitespace with /\s+/. To use other delimiting characters (e.g. "-") I currently have to set a completely new tokenizer function that mostly a copy of the original one.

There should be an option to set the delimiter characters or maybe to pass in a callback that does the splitting.

jgehrcke commented 9 years ago

Agreed. I, for example, also want strings split on underscore and needed to copy the original tokenize function for that. Making the split expression (currently /(?:\s+|\-)/ by default) configurable might be an option.

olivernn commented 9 years ago

In the latest (0.6.0) version I've added a property lunr.tokenizer.seperator that can be overridden to change the regex that is used to split a string into tokens.