rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Tokenize by graphemes, not codepoints #50

Closed rspeer closed 7 years ago

rspeer commented 7 years ago

After examining weird edge cases involving multi-codepoint emoji, I realized why we had to have the workaround in our TOKEN_RE about not splitting off diacritical marks from words. And the reason for both is the same: we had regular expressions that could stop matching in the middle of a grapheme.

A sufficiently powerful regex engine can match a grapheme using the \X symbol. If we make sure that the only way we advance through a string is by matching graphemes, we can avoid these edge cases.

The result is an expression that's simpler at its core, and also tokenizes flags and David Bowie correctly. 👨‍🎤