After examining weird edge cases involving multi-codepoint emoji, I realized why we had to have the workaround in our TOKEN_RE about not splitting off diacritical marks from words. And the reason for both is the same: we had regular expressions that could stop matching in the middle of a grapheme.
A sufficiently powerful regex engine can match a grapheme using the \X symbol. If we make sure that the only way we advance through a string is by matching graphemes, we can avoid these edge cases.
The result is an expression that's simpler at its core, and also tokenizes flags and David Bowie correctly. 👨🎤
After examining weird edge cases involving multi-codepoint emoji, I realized why we had to have the workaround in our
TOKEN_RE
about not splitting off diacritical marks from words. And the reason for both is the same: we had regular expressions that could stop matching in the middle of a grapheme.A sufficiently powerful regex engine can match a grapheme using the
\X
symbol. If we make sure that the only way we advance through a string is by matching graphemes, we can avoid these edge cases.The result is an expression that's simpler at its core, and also tokenizes flags and David Bowie correctly. 👨🎤