olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

Split on hyphens as well as whitespace #98

Closed nolanlawson closed 10 years ago

nolanlawson commented 10 years ago

Example:

Take the New York-San Francisco flight.

"york-san" isn't a word, so it shouldn't be output by the tokenizer.

For precedence, Lucene's standard tokenizer also splits on hyphens, although it doesn't do it for product numbers like 31-5-6, which is not implemented here due to complexity.

Prompted by https://github.com/nolanlawson/pouchdb-quick-search/issues/3.

olivernn commented 10 years ago

Thanks!

nolanlawson commented 10 years ago

No prob!

debug64 commented 10 years ago

I think the solution has a problem when indexing text like "A - B", it leads to an empty token which results in a broken (not searchable) index.

nolanlawson commented 10 years ago

Yup, you're right, that's a bug. Will fix.

olivernn commented 10 years ago

version 0.5.5 includes the fix for this issue.