unicode-rs / unicode-width

Displayed width of Unicode characters and strings according to UAX#11 rules.
https://unicode-rs.github.io/unicode-width
Other
215 stars 27 forks source link

Hangul Jamo Extended-B should be 0-width #26

Closed ninjalj closed 9 months ago

ninjalj commented 2 years ago

https://github.com/unicode-rs/unicode-width/blob/master/scripts/unicode.py#L304 has special-casing for U+1160..U+11FF (decimal 4448..4607, what's with using decimal values anyway?) (the part of the Hangul Jamo block which contains medial/vowels/jungseong and final/trailing_consonants/jongseong Jamo), to treat it as 0-width.

The Hangul Jamo Extended-B block at U+D7B0..U+D7FF contains jungseong and jongseong for Old Korean, and should be treated the same as U+1160..U+11F0.

glibc's wcwidth() treats that block as 0 width since:


commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to
0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <carlos@redhat.com>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0
Manishearth commented 2 years ago

Ah, that's some pretty old code, seems like straightforward bug. I don't have time to investigate this further right now though.