cursor location seems incorrect following zero-width-joined emoji

dankamongmen commented 3 years ago

Describe the bug Emoji joined via a ZWJ display properly, but the cursor is moved too far. An example:

[schwarzgerat](0) $ printf '\e[6n\U0001f9d1\u200d\U0001f33e\e[6n'
🧑‍🌾  [schwarzgerat](0) $ 1R5R

as you can see, the cursor report indicates that we have moved four -- which is accurate, as we have indeed moved four cursor positions forward. we only ought have moved two.

FWIW, it's nice that kitty actually implements this; it's ahead of most terminals in this regard.

To Reproduce Steps to reproduce the behavior:

Get a cursor report
Print an emoji created via ZWJ (there might be some that work; the example above does not)
Get a cursor report
Look and see that spaces have been printed where no spaces were desired

Screenshots 2021-07-07-102248_781x659_scrot

Environment details

kitty 0.21.2 (e07ba2c53d) created by Kovid Goyal
Linux schwarzgerat 5.12.14nlb #1 SMP Sun Jul 4 17:05:42 EDT 2021 x86_64
Debian GNU/Linux 11 \n \l
Running under:X11
Loaded config files:
  /etc/xdg/kitty/kitty.conf
  /home/dank/.config/kitty/kitty.conf

Config options different from defaults:
background_opacity    0.7
enable_audio_bell     False
font_family           Hack
initial_window_height (72, 'cells')
initial_window_width  (132, 'cells')
scrollback_lines      20000
update_check_interval 0.0

This debug output has been copied to the clipboard

Additional context Happens the same way.

kovidgoyal commented 3 years ago

Yeah it's on my TODO list to add the ~1400 such sequences from https://unicode.org/emoji/charts/emoji-zwj-sequences.html to gen-wcwidth.py, generate some kind of fast lookup data structure for them and then use them in kitty's wcswidth() and text processing functions, however its a fairly large task with very minimal payoff since that still wont actually make them use able until some terminal programs add support for them in their wcswidth() implementations. In fact the current cursor movement in kitty actually matches most shells and editors expectations, changing that is likely to actually break more things in the short term.

If you wish to add support for them PRs are welcome.

More general and complete grapheme clustering: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

dankamongmen commented 3 years ago

yep, my assessment is similar. i'm looking at trying to implement unicode faithfully in this regard, at which point i'll start seeing which terminals match it. wcswidth() definitely breaks down; i think a big piece of the puzzle here would be a self-contained low-level, fast, minimal-memory EGC-to-width map. if kitty already has a fairly advanced one, how difficult would it be to extract and package up as an .so (by difficult i don't so much mean technical aspects, but more would you be interested in moving that external to kitty, and who would be maintaining it, etc)?

kovidgoyal commented 3 years ago

Yes, that's another item for my endless TODO list. It's never been worth the trouble for me to actually do the work in the face of higher priority items, but if you wish to collaborate on it, I am happy to be a co-maintainer.

I would be willing to donate kitty's gen-wcwidth.py to such a project, and to change kitty to depend on the .so produced by such a project. The caveat of course is that gen-wcwidth.py is used for a lot more than just wcswidth(), so the project would either have to become a bit of a grab-bag of functions useful for terminal text processing, or else it would become standalone and we would need to keep kitty and the projects wcswidth() functions in sync by hand.

latipun7 commented 2 years ago

Sorry, I don't know the technical details, but I have an issue. When I type (copy paste) the emoji with ZWJ sequence, it does not print as a single emoji. technologist emoji Does my issue related to this issue? Thanks.

Step to reproduce:

go to https://emojipedia.org/technologist/ and copy (to the clipboard) the emoji
edit text file in terminal using file editor (nvim and go to insert mode)
paste to the terminal from clipboard (Ctrl+Shift+V)
the emoji shown as the image above

Debug Details

```log kitty 0.24.2 created by Kovid Goyal Linux ROG 5.16.10-arch1-1 #1 SMP PREEMPT Wed, 16 Feb 2022 19:35:18 +0000 x86_64 Arch Linux 5.16.10-arch1-1 (/dev/tty) Running under: X11 Frozen: False Paths: kitty: /usr/bin/kitty base dir: /usr/lib/kitty extensions dir: /usr/lib/kitty/kitty system shell: /usr/bin/zsh Loaded config files: /home/latipun/.config/kitty/kitty.conf Config options different from defaults: background_opacity 0.75 cursor_shape 3 font_size 13.0 scrollback_pager_history_size 19922944 window_padding_width FloatEdges(left=7.0, top=7.0, right=7.0, bottom=7.0) Changed shortcuts: ctrl+shift+delete → combine : clear_terminal active : send_text normal \x0c Colors: active_border_color #c9cbff active_tab_background #575268 active_tab_foreground #f5c2e7 background #1e1e2e bell_border_color #fae3b0 color0 #6e6c7e color1 #f28fad color10 #abe9b3 color11 #fae3b0 color12 #96cdfb color13 #f5c2e7 color14 #89dceb color15 #d9e0ee color2 #abe9b3 color3 #fae3b0 color4 #96cdfb color5 #f5c2e7 color6 #89dceb color7 #d9e0ee color8 #988ba2 color9 #f28fad cursor #f5e0dc cursor_text_color #1e1e2e foreground #d9e0ee inactive_border_color #575268 inactive_tab_background #1e1e2e inactive_tab_foreground #d9e0ee mark1_background #96cdfb mark1_foreground #1e1e2e mark2_background #f5c2e7 mark2_foreground #1e1e2e mark3_background #b5e8e0 mark3_foreground #1e1e2e selection_background #575268 selection_foreground #d9e0ee tab_bar_background #161320 url_color #f5e0dc Environment variable names seen by the kitty process: ALTUSERXSESSION BROWSER CARGO_HOME DBUS_SESSION_BUS_ADDRESS DESKTOP_SESSION DISPLAY EDITOR ERRFILE FNM_DIR FORCE_COLOR GDMSESSION GH_TOKEN GOPATH GPG_TTY GTK2_RC_FILES GTK_MODULES GTK_RC_FILES HOME HOMEBREW_GITHUB_API_TOKEN LANG LC_COLLATE LC_CTYPE LC_MEASUREMENT LC_NUMERIC LC_TELEPHONE LESS LOGNAME MAIL MANPAGER MOTD_SHOWN NNN_ARCHIVE NNN_COLORS NNN_FCOLORS NNN_FIFO NNN_OPTS NPM_TOKEN PATH PWD QT_QPA_PLATFORMTHEME RUSTUP_HOME SHELL SHLVL USER USERXSESSION USERXSESSIONRC VISUAL VSCE_PAT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_GREETER_DATA_DIR XDG_RUNTIME_DIR XDG_SEAT XDG_SEAT_PATH XDG_SESSION_CLASS XDG_SESSION_DESKTOP XDG_SESSION_ID XDG_SESSION_PATH XDG_SESSION_TYPE XDG_STATE_HOME XDG_VTNR ```

ghost commented 2 years ago

This seems related to my issue, so I'm posting here. I just switched to gnome-terminal which has inconsistent rendering of emojis. At first I was pleased to see that kitty renders them properly, but then I noticed that the cursor positioning is problematic. As it stands, this is actually less usable than gnome-terminal.

https://user-images.githubusercontent.com/482367/158066201-131e8cb2-d788-44f9-bfa8-64d39ce5a190.mp4

kovidgoyal commented 2 years ago

That will be because whatever terminal program you are running is using a different width calculation from what kitty uses, which is based on the unicde standard. And unless your emoji is using zwj it is completely unrelated so post elsewhere.

trygveaa commented 2 years ago

Sorry, I don't know the technical details, but I have an issue. When I type (copy paste) the emoji with ZWJ sequence, it does not print as a single emoji. Does my issue related to this issue? Thanks.

@latipun7: No, this is caused by zsh not supporting ZWJ. If you run cat and paste it you'll see that it works fine (except for the cursor problem this issue is about).

sharnoff commented 2 years ago

@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need.

If there's interest, I'd be happy to provide suggestions for how it could be implemented in C

kovidgoyal commented 2 years ago

On Mon, Jun 13, 2022 at 08:56:55AM -0700, Max Sharnoff wrote:

@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need.

If there's interest, I'd be happy to provide suggestions for how it could be implemented in C

Sure, I am always happy to discuss design ideas. By this sort of thing do you mean wcswidth() in general or looking up emoji combining sequences in particular. For zwj+emoji support in kitty one needs basically:

1) Adding zwj+emoji support to wcswidth_step() which basically tells you how the width of a string changes when you add a codepoint to it.

2) Changing kitty's cell data structure to support infinite length codepoint strings. This will likely be a auxilliary hash mapping shorts to heap allocated codepoint arrays. The shorts will be stored per cell. This will reduce cell memory usage by 4 bytes at the cost of making looking up the text in a cell more expensive (which is fortunately not a frequent operation).

sharnoff commented 2 years ago

do you mean wcswidth() in general or looking up emoji combining sequences in particular

I was looking just at emoji sequences -- given that they're all suppposed to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column).

For adding support:

I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing wcswidth_step() gets a little tricky, because maybe it's implicitly expected not to decrease the width. Another tricky spot is handling cases like "this is most of a long zwj sequence, but it's missing the end so it's actually mutiple separate emoji now" -- there's a bit of extra work if those need to be separated into cells in post (and possibly re-parsed for new zwj sequence starters - I'm not sure whether any cases of this are possible).

I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes may be possible.

kovidgoyal commented 2 years ago

On Mon, Jun 13, 2022 at 02:27:19PM -0700, Max Sharnoff wrote:

do you mean wcswidth() in general or looking up emoji combining sequences in particular

I was looking just at emoji sequences -- given that they're all suppposed to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column).

In a terminal context width calculations must be font independent.

For adding support:

I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing wcswidth_step() gets a little tricky, because maybe it's implicitly expected not to decrease the width. Another tricky spot is handling cases like "this is most of a long zwj sequence, but it's missing the end so it's actually mutiple separate emoji now" -- there's a bit of extra work if those need to be separated into cells in post (and possibly re-parsed for new zwj sequence starters - I'm not sure whether any cases of this are possible).

wcswidth_step() must be able to reduce width to support VS16 which converts emoji to text presentation. So that's not an issue.

I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes may be possible.

There is no empty space in the struct at the moment, however given the small number of such sequences one could encode them into the existing codepoint to mark mapping (kitty maps all combining unicode marks to numbers since there are only about 2000 odd of them they fit in a short. One could easily add another 2000 marks to this mapping. However, the question is how future proof this is and whether its not better to just allow infinite length sequences now.

dankamongmen commented 2 years ago

Sure, I am always happy to discuss design ideas. By this sort of thing do you mean wcswidth() in general or looking up emoji combining sequences in particular. For zwj+emoji support in kitty one needs basically: 1) Adding zwj+emoji support to wcswidth_step() which basically tells you how the width of a string changes when you add a codepoint to it. 2) Changing kitty's cell data structure to support infinite length codepoint strings. This will likely be a auxilliary hash mapping shorts to heap allocated codepoint arrays. The shorts will be stored per cell. This will reduce cell memory usage by 4 bytes at the cost of making looking up the text in a cell more expensive (which is fortunately not a frequent operation).

fwiw @sharnoff notcurses would need pretty much the exact same thing as @kovidgoyal mentions for his wcswidth_step(), except it would be in utf8_egc_len(). notcurses already handles arbitrarily long EGCs via spillover to the egcpool structure, of which there is one per ncplane.

kovidgoyal / kitty

cursor location seems incorrect following zero-width-joined emoji #3810