Open martindholmes opened 3 weeks ago
We have a real use-case here: the nxaʔamxčín language uses u+203F undertie as a character inside words.
The only small hang-up I have about this is underscore, which is part of \p{Pc}
— I can imagine this may have unintended side-effects: if you had the phrase "project_filename" in your documents, then previously this would have been treated as two terms to stem ("project" and "filename"), but now it would treated as a single one to stem. That's probably better in most cases, but what do you think?
I think if you're using an underscore in a context like that, you're probably expecting the concatenation to be treated as a single object. I can't think of any context where I would want a search indexer to split an identifier on underscores.
The Unicode category of Connector Punctuation (https://www.unicode.org/charts/script/chart_Punctuation-Connector.html), which is a small collection of punctuation-like symbols which are used as connectors within words. We should include this character class in the regular expression $alphanumeric, which is how we designate words for the purposes of tokenization, since these characters should not be taken as word-boundaries. PERL includes this class in its
\w
class (https://perldoc.perl.org/perlrecharclass#Word-characters).