Tokenize by graphemes, not codepoints

After examining weird edge cases involving multi-codepoint emoji, I realized why we had to have the workaround in our TOKEN_RE about not splitting off diacritical marks from words. And the reason for both is the same: we had regular expressions that could stop matching in the middle of a grapheme.

A sufficiently powerful regex engine can match a grapheme using the \X symbol. If we make sure that the only way we advance through a string is by matching graphemes, we can avoid these edge cases.

The result is an expression that's simpler at its core, and also tokenizes flags and David Bowie correctly. 👨‍🎤

rspeer / wordfreq

Tokenize by graphemes, not codepoints #50