"Edit | Fill Paragraph" (Ctrl+J) seems to work with UTF-8 byte count, not unicode code point count

lersek commented 6 months ago

(My xnedit was built at commit b1c382a, so it's not exactly recent.)

I've noticed that, using the LC_CTYPE=hu_HU.UTF-8 locale, "Fill Paragraph" creates much shorter lines when I work with Hungarian text than when I work with English text.

When entering purely English / ASCII text, paragraphs get actually filled to the line length that I configure in Preferences | Wrap | Wrap Margin, but for Hungarian text, lines end up much shorter than desired. In the former case, basically all the code points I enter are pure ASCII (single-byte encoding in UTF-8), whereas in Hungarian, we use a bunch of accented characters such as in ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP, which translate to multibyte encodings in UTF-8. Hence my suspicion that paragraph filling works with a byte count, rather than a unicode character count.

(Well, if we consider "combining characters", then the raw unicode character count would also be incorrect, I guess, so probably some unicode normalization like NFC should occur, before counting characters? Either way, unicode character count would be a better approximation than just byte count.)

Thanks for considering!

unixwork commented 6 months ago

Hi, Thanks for the bug report. Your observation is correct, the number of bytes were counted, not the number of characters or code points.

With the latest commit, code points are counted. It needs some more testing, but it should work better now.

lersek commented 6 months ago

I've rebuilt xnedit at a8069293092a (latest master), which contains the fix (72931277da55). It seems to be working well on my end. Thank you for the quick correction!

unixwork / xnedit

"Edit | Fill Paragraph" (Ctrl+J) seems to work with UTF-8 byte count, not unicode code point count #136