unicode-rs / unicode-width

Displayed width of Unicode characters and strings according to UAX#11 rules.
https://unicode-rs.github.io/unicode-width
Other
216 stars 27 forks source link

Fixes to characters considered zero-width #34

Closed Jules-Bertholet closed 9 months ago

Jules-Bertholet commented 9 months ago

These characters are supposed to be completely invisible and ignored by rendering unless specially supported: https://www.unicode.org/faq/unsup_char.html#3. Characters affected


Edit: Now also fixes #26


Edit 2: I've marked Prepended_Concatenation_Marks as not zero-width. This matches the behavior of glibc


Edit 3: I've given U+115F HANGUL CHOSEONG FILLER back its width 2, because it's expected to be combined with other jamo to form a width-2 syllable block.

Manishearth commented 9 months ago

This implements a specific standardized algorithm as documented in the readme. This rule around Default_Ignorable doesn't seem to be documented there.

This is not a general purpose terminal width library.

Jules-Bertholet commented 9 months ago

This library already differs from UAX 11 in several important ways:

Manishearth commented 9 months ago

Hmm, yeah. I didn't originally write this but I would like for the code to follow the spec first and offer these things as settings

Jules-Bertholet commented 9 months ago

UAX 11 doesn't really give a full, exact algorithm for getting a "width value" for a string. For example, control codes aren't even mentioned, nor are line breaks etc. So I think referring to other parts of the Unicode standard as well makes perfect sense.

Manishearth commented 9 months ago

Hmm that's fair. Will review later.

I would ideally like someone to take a holistic view of this crate, compare with the specs, and document/add options. Haven't had time to do this myself ever since I inherited it.

Jules-Bertholet commented 9 months ago

I would ideally like someone to take a holistic view of this crate, compare with the specs, and document

I've added some comments throughout the code, but here is a summary of the current rules (with this PR's changes included):

What's still not handled, or could be handled differently:

Jules-Bertholet commented 9 months ago

https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf "Measurement" section highlights more problem cases

Jules-Bertholet commented 9 months ago

See also https://www.unicode.org/versions/Unicode15.1.0/ch05.pdf#G40095, "Characters Ignored for Display"

Jules-Bertholet commented 9 months ago

Unicode §5.21 - "Characters Ignored for Display" - "Default Ignorable Code Point" says:

A small number of format characters (General_Category = Cf ) are also not given the Default_Ignorable_Code_Point property. This may surprise implementers, who often assume that all format characters are generally ignored in fallback display. The exact list of these exceptional format characters can be found in the Unicode Character Database. There are, however, three important sets of such format characters to note:

  • prepended concatenation marks
  • interlinear annotation characters
  • Egyptian hieroglyph format controls

The prepended concatenation marks always have a visible display. See “Prepended Concatenation Marks” in Section 23.2, Layout Controls for more discussion of the use and display of these signs.

The other two notable sets of format characters that exceptionally are not ignored in fallback display consist of the interlinear annotation characters, U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR, and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. These characters should have a visible glyph display for fallback rendering, because if they are not displayed, it is too easy to misread the resulting displayed text. See “Annotation Characters” in Section 23.8, Specials, as well as Section 11.4, Egyptian Hieroglyphs for more discussion of the use and display of these characters.

Software that interprets the interlinear annotation characters should probably do that processing before passing to unicode-width, so assuming fallback rendering makes sense in that case. Additionally, next to no implementations currently support the Egyptian hieroglyph format controls, so assuming a fallback rendering probably makes sense there as well. Therefore, I've marked both as non-zero width.