We should not tokenize on connector punctuation

projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection

https://endings.uvic.ca/staticSearch/docs/index.html

Mozilla Public License 2.0

51 stars 22 forks source link

We should not tokenize on connector punctuation #319

Open martindholmes opened 3 weeks ago

martindholmes commented 3 weeks ago

The Unicode category of Connector Punctuation (https://www.unicode.org/charts/script/chart_Punctuation-Connector.html), which is a small collection of punctuation-like symbols which are used as connectors within words. We should include this character class in the regular expression $alphanumeric, which is how we designate words for the purposes of tokenization, since these characters should not be taken as word-boundaries. PERL includes this class in its \w class (https://perldoc.perl.org/perlrecharclass#Word-characters).

martindholmes commented 3 weeks ago

We have a real use-case here: the nxaʔamxčín language uses u+203F undertie as a character inside words.

joeytakeda commented 2 weeks ago

The only small hang-up I have about this is underscore, which is part of \p{Pc} — I can imagine this may have unintended side-effects: if you had the phrase "project_filename" in your documents, then previously this would have been treated as two terms to stem ("project" and "filename"), but now it would treated as a single one to stem. That's probably better in most cases, but what do you think?

martindholmes commented 2 weeks ago

I think if you're using an underscore in a context like that, you're probably expecting the concatenation to be treated as a single object. I can't think of any context where I would want a search indexer to split an identifier on underscores.