mookid / diffr

Yet another diff highlighting tool
MIT License
572 stars 22 forks source link

feat: Improve UTF-8 support by keeping graphemes together when tokenizing #100

Closed bojidar-bg closed 7 months ago

bojidar-bg commented 7 months ago

This uses bstr's grapheme_indices method to break the input text into graphemes, and then classifies them as words/spaces/other similarly as before.

Some screenshots (Konsole):

Test Before After
UTF-8: Cyrillic :sparkles:, Emojis :sparkles:, Combining characters :sparkles: image image (ignore spurious tab on 4)
Windows-1251 :sweat_smile:
(diff | diffr | iconv)
image image
bojidar-bg commented 7 months ago

Regression should be fixed! Let me know if I missed any other ones :sweat_smile:

As for is_whitespace: I feel I would rather lean on the safe side and leave it only to ascii ' ' and '\t' for now—I'd personally prefer to see fancy whitespaces like nbsp highlighted individually when they are changed, but if someone comes with a usecase where that's distracting, it wouldn't be too hard to enable treating Unicode whitespace characters as regular spaces. (: