character width seems to be "random" on complexe Unicode sequences

Cl00e9ment commented 2 years ago

For sake of understanding here's the nomenclature that I'm using:

scalar value: A Unicode unit, each scalar value is assigned a code point and is encoded using 1 to 4 bytes (in UTF-8).
grapheme cluster: What the end user call a character and is rendered as a single glyph. Grapheme cluster are made of at least one scalar value.

Describe the bug Some grapheme clusters are rendered in a single cell and some other are taking multiple cells, sometimes leaving a huge blank.

To Reproduce example 1: nb of grapheme clusters: 1 nb of scalar values: 2 nb of cells used for rendering: 1 echo -e "0123456789\n>\u0067\u0308<" Screenshot from 2022-05-01 21-10-42 This is what is expected to happen: 1 grapheme cluster = 1 cell.

example 2: nb of grapheme clusters: 1 nb of scalar values: 2 nb of cells used for rendering: 2 echo -e "0123456789\n>\u2600\ufe0f<" Screenshot from 2022-05-01 21-13-49 This is not what I would expect (1 grapheme cluster = 1 cell) but maybe it's a normal behavior. If this is normal, is there a set of rules that I can use to determine the nb of cells that a grapheme cluster will take for rendering?

example 3: nb of grapheme clusters: 1 nb of scalar values: 3 nb of cells used for rendering: 4 echo -e "0123456789\n>\u1100\u1161\u11A8<" Screenshot from 2022-05-01 21-18-15 This doesn't make any sense.

Environment details

kitty 0.24.4 created by Kovid Goyal
Linux workstation 5.15.32-1-MANJARO #1 SMP PREEMPT Mon Mar 28 09:16:36 UTC 2022 x86_64
Manjaro Linux 5.15.32-1-MANJARO  (workstation) (/dev/tty)

DISTRIB_ID=ManjaroLinux
DISTRIB_RELEASE=21.2.6
DISTRIB_CODENAME=Qonos
DISTRIB_DESCRIPTION="Manjaro Linux"
Running under: X11
Frozen: False
Paths:
  kitty: /usr/bin/kitty
  base dir: /usr/lib/kitty
  extensions dir: /usr/lib/kitty/kitty
  system shell: /bin/zsh

Config options different from defaults:

Important environment variables seen by the kitty process:
    PATH                                /home/clement/.local/bin:/usr/local/bin:/usr/bin:/var/lib/snapd/snap/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/clement/.cargo/bin
    LANG                                en_US.UTF-8
    EDITOR                              nvim
    SHELL                               /bin/zsh
    DISPLAY                             :0
    USER                                clement
    XDG_MENU_PREFIX                     gnome-
    LC_ADDRESS                          fr_FR.UTF-8
    LC_NAME                             fr_FR.UTF-8
    LC_MONETARY                         fr_FR.UTF-8
    XDG_SESSION_DESKTOP                 gnome-xorg
    XDG_SESSION_TYPE                    x11
    LC_PAPER                            fr_FR.UTF-8
    XDG_CURRENT_DESKTOP                 GNOME
    XDG_SESSION_CLASS                   user
    LC_IDENTIFICATION                   fr_FR.UTF-8
    LC_TELEPHONE                        fr_FR.UTF-8
    LC_MEASUREMENT                      fr_FR.UTF-8
    XDG_RUNTIME_DIR                     /run/user/1000
    LC_TIME                             fr_FR.UTF-8
    XDG_DATA_DIRS                       /home/clement/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop
    LC_NUMERIC                          fr_FR.UTF-8

kovidgoyal commented 2 years ago

The width in cells of grapheme clusters come from the unicode standard. Your example 2 is a variation selector changing the emoji presentation of the preceding codepoint from text to emoji. emoji are rendered in two cells in terminals. I have no clue about hangul so I cant explain your last example to you. You would need to ask the makers of the unicode standard. See gen-wcwidth.py in kitty for how the functions to determine width are generated from the standard.

Cl00e9ment commented 2 years ago

Thanks for the clarifications.

Cl00e9ment commented 2 years ago

The width in cells of grapheme clusters come from the unicode standard.

If I'm not mistaken, Unicode don't specify the width of grapheme clusters. It's up to the implementation to define this. UAX #11 differentiates narrow and wide characters in East Asian text, but that's all.

So, are you sure that the third example (see bellow) isn't a Kitty bug but an issue with the Unicode standard?

echo -e "0123456789\n>\u1100\u1161\u11A8<"

I understand that gen-wcwidth.py assign a cell width for each scalar value, but I don't see where this directly comes from the Unicode standard (apart for East Asian scalar values).

kovidgoyal commented 2 years ago

They are width one unless they are emoji or combining marks which are width zero (with some slight subtleties read the source of gen-wcwidth.py) Its a pretty simple rule.

Cl00e9ment commented 2 years ago

Yes but the rules that were defined in gen-wcwidth.py don't seam to work everywhere. The Hangul alphabet is an example of an edge case.

Maybe Hangul initial consonants should be given a size of 2, and medial vowels as well as final consonants should be given a size of 0. That's only a suggestion, as I've no idea of how Hangul works.

Cl00e9ment commented 2 years ago

OK I think that I understand what's happening.

UAX #11 gives a size of 2 to Hangul initial consonants (HIC) and a size of 1 to both Hangul medial vowels (HMV) and final consonants (HFC).

The problem is that, when one HIC is followed by one HMV and optionally one HFC, they merge together to form a single grapheme cluster. The widths are added together (2 + 1 = 3 or 2 + 1 + 1 = 4) instead of using a size of 2.

kovidgoyal commented 2 years ago

someone will need to codify that then. And publish it as a standard so terminal programs can rely on it.

Cl00e9ment commented 2 years ago

I fully agree.

As a side note, this issue also affects emoji combination with zero width joiner. Example with "face in clouds":

echo -e "0123456789\n>\U1F636\U1F32B\uFE0F<"
echo -e "0123456789\n>\U1F636\u200D\U1F32B\uFE0F<"

Screenshot from 2022-05-02 17-02-56

kovidgoyal commented 2 years ago

THat is a bug look at the open issue about it.

Cl00e9ment commented 2 years ago

You're right, the size problem with combining emojis is a duplicate of #1978.

But further than that, I think that the rendering issue with Hangul grapheme clusters is the same bug. Even if they aren't built using zero width joiner like emoji combinations, the underlying problem is the same: multiple grapheme clusters that are rendered using a specific size, but when put together, merge to a single grapheme cluster that take less space than the sum of the previous ones.

kovidgoyal commented 2 years ago

The difference is for zwj + emoji there are well defined rules accessible to me in the unicode standard. For hangul I have no clue. As I said someone who understand hangul will either need to codify those rules and publish them or point out where in the standard they already exist in a form that can be converted to wcswidth() implementation.

kovidgoyal / kitty

character width seems to be "random" on complexe Unicode sequences #5047