gphalkes / tilde

The Tilde text editor
395 stars 21 forks source link

Non-consensual normalization of Unicode text into NFC #75

Open szc126 opened 3 years ago

szc126 commented 3 years ago

What OS are you using (including version): Arch Linux 5.11.13-arch1-1 What terminal were you running Tilde on when you ran into the issue: alacritty 0.7.2 (5ac8060b)

Please paste the result of running 'tilde --version':

Tilde version 1.1.2
Copyright (c) 2011-2018 G.P. Halkes
Tilde is licensed under the GNU General Public License version 3
Library versions:
  libt3config 1.0.0
  libt3highlight 0.5.0
  libt3key (through libt3widget) 0.2.10
  libt3widget 1.2.0
  libt3window 0.4.0
  libtranscript 0.3.3
  libunistring 9.10.?

Please describe the problem. If possible, include the action you performed, the expected result and the actual result in the description.

Performed:

Expected:

Actual:

gphalkes commented 3 years ago

This is a bit of a tricky issue. Not on a technical level, because there is an explicit normalization step in Tilde which can easily be removed. However, most users will not be aware of the concept of normalization and other technicalities of the Unicode standard and want their software to "just do the right thing". What the right thing is, is of course debatable and will vary according to circumstances. Writing files as normalized text when writing new files is for most users likely the right thing to do. They don't intentionally use separate letters and accents or separate Hangul syllables etc to have them written as separate code points. They might enter their text that way because it is convenient, but probably just want things to work as if they are one character afterwards.

Some extra info from the Unicode consortium FAQ on Normalization (high-lighting mine):

Q: Which forms of normalization should I support?

A: The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.

When saving an existing text file, rewriting to normalized form is less likely to be the correct course of action if the non-NFC form is already used in the existing text.

Instead of blindly converting to NFC though, Tilde should probably check whether the text is NFC (also on reading) and provide the user with the appropriate information and question on saving. If the text was already non-NFC on reading, it shouldn't be converted on writing (without any prompt, with a "normalize to NFC" action in one of the menus to force the conversion). If the text added by the user is the only non-NFC part of the text (new file or file which was NFC on load but is no longer NFC on save), then a prompt should show asking the user what they want.

I think the above is the most appropriate, but it may be a while before I have time to implement this.