unicode-rs / unicode-width

Displayed width of Unicode characters and strings according to UAX#11 rules.
https://unicode-rs.github.io/unicode-width
Other
215 stars 27 forks source link

Make characters with `Line_Break=Ambiguous` ambiguous #61

Closed Jules-Bertholet closed 3 weeks ago

Jules-Bertholet commented 5 months ago

UAX 14:

As originally defined, the line break class AI contained all characters with East_Asian_Width value A (ambiguous width) that would otherwise be AL in this classification. For more information on East_Asian_Width and how to resolve it, see Unicode Standard Annex #11, East Asian Width [UAX11].

The original definition included many Latin, Greek, and Cyrillic characters. These characters are now classified by default as AL because use of the AL line breaking class better corresponds to modern practice. Where strict compatibility with older legacy implementations is desired, some of these characters need to be treated as ID in certain contexts. This can be done by always tailoring them to ID or by continuing to classify them as AI and resolving them to ID where required.

As part of the same revision, the set of ambiguous characters has been extended to completely encompass the enclosed alphanumeric characters used for numbering of bullets.

As updated, the AI line breaking class includes all characters with East Asian Width A that are outside the range U+0000..U+1FFF, plus the following characters:

24EA
CIRCLED DIGIT ZERO
2780..2793 DINGBAT CIRCLED SANS-SERIF DIGIT ONE..DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN