Consider tweaking tokenization

hoelzro / tw-full-text-search

Full text search plugin for TiddlyWiki powered by lunr.js

https://hoelz.ro/files/fts.html

Other

25 stars 4 forks source link

Consider tweaking tokenization #33

Open hoelzro opened 5 years ago

hoelzro commented 5 years ago

I might want to tweak how the plugin uses lunr to tokenize things, to handle hyphenated words or URLs.

Examples:

https://github.com/hoelzro/tw-full-text-search/issues/5#issuecomment-441724510

https://github.com/hoelzro/tw-full-text-search/blob/9d383acb81c61608b7b5cbc61ced161ce4d54c95/tests/test-simple.js#L269-L281

hoelzro commented 5 years ago

Another interesting data point for this: e-mail is treated as two tokens, which kind of screws things up

Would it make sense just to use a tokenizer that recognizes certain exceptions (like e-mail) and certain special prefixes (like re-)? Alternative to a list of exceptions, we could have logic that bundles prefixes of a certain length (eg. 3 or fewer characters)