kovidgoyal / kitty

Cross-platform, fast, feature-rich, GPU based terminal
https://sw.kovidgoyal.net/kitty/
GNU General Public License v3.0
23.8k stars 961 forks source link

incorrect handling of CJK ambiguous width characters #6560

Closed ctrlcctrlv closed 1 year ago

ctrlcctrlv commented 1 year ago

Describe the bug UAX №11 defines East Asian Width, or CJK Width.

The spec reads:

Ambiguous width characters are all those characters that can occur as fullwidth characters in any of a number of East Asian legacy character encodings. They have a “resolved” width of either narrow or wide depending on the context of their use. If they are not used in the context of the specific legacy encoding to which they belong, their width resolves to narrow. Otherwise, it resolves to fullwidth or halfwidth. The term context as used here includes extra information such as explicit markup, knowledge of the source code page, font information, or language and script identification. For example:

  • Greek characters resolve to narrow when used with a standard Greek font, because there is no East Asian legacy context.
  • Private-use character codes and the replacement character have ambiguous width, because they may stand in for characters of any width.
  • Ambiguous quotation marks are generally resolved to wide when they enclose and are adjacent to a wide character, and to narrow otherwise.

The East_Asian_Width property does not preserve canonical equivalence, because the base characters of canonical decompositions almost always have a different East_Asian_Width than the precomposed characters. East Asian Width is designed for use with legacy character sets so the property value is not designed to respect canonical equivalence.

Modern Rendering Practice. Modern practice is evolving toward rendering ever more of the ambiguous characters with proportionally spaced, narrow forms that rotate with the direction of writing, making a distinction within the legacy character set. In other words, context information beyond the choice of font or source character set is employed to resolve the width of the character. This annex does not attempt to track such changes in practice; therefore, the set of characters with mappings to legacy character sets that have been assigned ambiguous width constitute a superset of the set of such characters that may be rendered as wide characters in a given context. In particular, an application might find it useful to treat characters from alphabetic scripts as narrow by default. Conversely, many of the symbols in the Unicode Standard have no mappings to legacy character sets, yet they may be rendered as “wide” characters if they appear in an East Asian context. An implementation might therefore elect to treat them as ambiguous even though they are classified as neutral here.

5 Recommendations

When mapping Unicode to East Asian legacy character encodings

  • Wide Unicode characters always map to fullwidth characters.
  • Narrow (and neutral) Unicode characters always map to halfwidth characters.
  • Halfwidth Unicode characters always map to halfwidth characters.
  • Ambiguous Unicode characters always map to fullwidth characters.

Emphasis mine.

To Reproduce Steps to reproduce the behavior:

  1. Type コピペ★
  2. See error image

Environment details

kitty 0.28.1 (877d8d7008) created by Kovid Goyal
Linux debu.tanuki.agency 6.4.10-zen2-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sun, 13 Aug 2023 01:33:59 +0000 x86_64
Arch Linux 6.4.10-zen2-1-zen (/dev/tty)

DISTRIB_ID="Arch"
DISTRIB_RELEASE="rolling"
DISTRIB_DESCRIPTION="Arch Linux"
Running under: Wayland
Frozen: False
Paths:
  kitty: /usr/bin/kitty
  base dir: /usr/lib/kitty
  extensions dir: /usr/lib/kitty/kitty
  system shell: /bin/bash
Loaded config files:
  /home/fred/.config/kitty/kitty.conf

Config options different from defaults:
bold_font             IBM Plex Sans Mono Bold
bold_italic_font      IBM Plex Sans Mono Bold Italic
cursor_blink_interval 0.0
font_family           IBM Plex Sans Mono
font_size             16.0
force_ltr             True
italic_font           IBM Plex Sans Mono Italic

Important environment variables seen by the kitty process:
    PATH                                /opt/google-cloud-cli/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/home/fred/.dotnet/tools:/var/lib/flatpak/exports/bin:/usr/lib/jvm/default/bin:/usr/lib32/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/var/lib/snapd/snap/bin
    LANG                                en_US.UTF-8
    SHELL                               /bin/bash
    GLFW_IM_MODULE                      ibus
    DISPLAY                             :1
    WAYLAND_DISPLAY                     wayland-0
    USER                                fred
    XCURSOR_SIZE                        24
    XDG_CACHE_HOME                      /home/fred/.cache
    XDG_CONFIG_DIRS                     /home/fred/.config/kdedefaults:/etc/xdg
    XDG_CONFIG_HOME                     /home/fred/.config
    XDG_CURRENT_DESKTOP                 KDE
    XDG_DATA_DIRS                       /home/fred/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share:/var/lib/snapd/desktop
    XDG_DATA_HOME                       /home/fred/.local/share
    XDG_RUNTIME_DIR                     /run/user/1000
    XDG_SEAT                            seat0
    XDG_SEAT_PATH                       /org/freedesktop/DisplayManager/Seat0
    XDG_SESSION_CLASS                   user
    XDG_SESSION_DESKTOP                 KDE
    XDG_SESSION_ID                      936
    XDG_SESSION_PATH                    /org/freedesktop/DisplayManager/Session1
    XDG_SESSION_TYPE                    wayland
    XDG_STATE_HOME                      /home/fred/.local/state
    XDG_VTNR                            2

Additional context

ctrlcctrlv commented 1 year ago

I suggest the following default rules:

It is itself EAW fullwidth.

kovidgoyal commented 1 year ago

As far as I know no terminal programs follow these rules. Changing it in kitty will break things for anyone actually using these characters. Ideally developers of several major TUI programs should agree to this before it is implemented in kitty. Currently as far as I know there are no actual issues reported by kitty users for ambiguous width characters, making this change will cause issues when the program running in the terminal will no longer agree with kitty on what the width should be.

As such, I am not particularly keen to implement this. If you can point to some other terminal emulators or better major terminal programs that have implemented or plan to implement it, I will reconsider.

ctrlcctrlv commented 1 year ago

mlterm follows these rules: image

ctrlcctrlv commented 1 year ago

https://github.com/fumiyas/wcwidth-cjk too

kovidgoyal commented 1 year ago

There is no way wcwidth can implement the algorithm you describe since it returns widths of characters in isolation. One would need wcswidth for that.

ctrlcctrlv commented 1 year ago

i did not name wcwidth-cjk repo