Correct Grapheme Width - Githubissues

pascalkuthe commented 1 year ago

There have been multiple issues piling up on the issue tracker where helix behaves weirdly in the presence of certain unicode graphemes (usually emoji like characters):

Furthermore some emojis are rendered too wide with additional black space. Open the following line in helix for example:

https://github.com/helix-editor/helix/blob/715c4b24d94c9e2fa70d5d59ce658b89fbde0392/helix-core/src/graphemes.rs#L112

I have been looking into these issues and the underlying cause/fix is the same but the solution is not trivial. To avoid collecting a bunch of unrelated issues I decided to create an umbrella issue here and record the results of my research.

Since the problems manifest differently (or not at all) across various editors I assumed that this was simply a case of there being no common standard (I was not 100% wrong see below) and that we could not do much about this. However looking into this further it seems that other tui applications (like nvim) do handle these characters correctly. While some characters may overlap in some editors that resize characters (kitty) there are no weird visual glitches like with helix.

It seems that terminal emulators for comparability reasons all mostly agree on how many terminal column a grapheme should take (even if kitty renders some larger that doesn't affect the actual grid layout).

The problem is that the width supplied by unicode_width does not align with this character grid. Adding one or two small edgecases like suggested in https://github.com/helix-editor/helix/issues/4932#issuecomment-1380884056 doesn't work because there are a actually a LOT of edgecases (that all behave differently). The comment by wez linked there is quite old.

Nowadys wezterm uses termwiz instead which uses https://github.com/ridiculousfish/widecharwidth/ to generate a much more accurate column width function (and then performs some special casing and emoji detection on top of that).

Even then depending on which version of Unicode is targeted the correct output may be different, see https://wezfurlong.org/wezterm/config/lua/config/unicode_version.html.

There are a couple ways forward:

We should replace unicode_width with something more accurate based on https://github.com/ridiculousfish/widecharwidth/ similar to what termwiz does in helix-core
The also need to be done in helix-tui
- open question: Does termwiz do any further magic here (I don't think so) or is just using the correct width enough
We should allow configuring the unicode version like wezterm does. Ideally we could even try to support these osc escape sequences to set the correct unicode version
We might just switch to termwiz and get all of this for free. However termwiz is quite heavy (large codebas, depends on multiple hashing algorithms, the pest parser generator and 3 different unicode segementation crates). Do we want to do that?

kchibisov commented 1 year ago

The problem is that the width supplied by unicode_width does not align with this character grid. Adding one or two small edgecases like suggested in #4932 (comment) doesn't work because there are a actually a LOT of edgecases (that all behave differently). The comment by wez linked there is quite old.

The width of the characters is usually defined by the unicode standard, so the comment wrt emojis is not really good(if you follow the link chain). If you every tried using a terminal which does ZWJ combinations(kitty) and put them in e.g. bash it'll simply blow up.

Changing the width function will simply shift the issue, you'll probably make things look the same in wezterm, but break 3 other terminals using conservative width functions, like wcwidth from glibc or unicode-width crate.

I think the only real way to solve anything here is to use OSC sequences which helps define width for edgecases, like ZWJ. And at the very least do a research wrt who supports what. But I think I only heard about it, and never seen, probably contour author told me about it at some point.

The good idea would be to check what contour, kitty, and wezterm does wrt handling of conservative applications, like bash. If they unconditionally alter the width (I think at least some of them is altering the width at runtime).

To sum up, changing the width function will simply move the issue to some other terminals from the ones you see in the reports.

Also, you linked the issues from windows and kitty, while kitty is know to be "advanced in that area"(it does emoji combining breaking the total width), I should warn any non familiar with windows reader wrt state of things on this platform.

When it comes to windows, you have a shim (ConPty) between(helix) you and the terminal. This shim maintains its own grid, does reflow on it(at least it was doing so in the past), and wasn't even passing through CJK in some old revisions in a way it should, the cursor movements are also weird(I think I have a report from a windows user on a monthly basis that they can't move one char up in plain fully ascii environment and how updating windows version solves the issue).

So unless microsoft will do a passthourgh mode in their shim and provide it for any other terminal on windows I'd take every issue from windows platform with a grain of salt. You can't really solve them and you simply wait for microsoft to fix their software.

Also, be aware that ConPty is also being bundled by some terminals, because microsoft don't really care about updating their system ConPty version, so can't be sure what is even used in such issues.

EpocSquadron commented 1 year ago

The author of the still-in-private-beta ghostty terminal wrote about this fairly recently. An emerging standard (mode 2027) from contour author allows us to query for support for proper grapheme width calculation and fall back to wcwidth if not, which should achieve much better results.

rockorager commented 1 year ago

The author of the still-in-private-beta ghostty terminal wrote about this fairly recently. An emerging standard (mode 2027) from contour author allows us to query for support for proper grapheme width calculation and fall back to wcwidth if not, which should achieve much better results.

Note that foot also supports this (PR).

Foot, contour, ghostty, and wezterm are the only four terminals which employ grapheme clustering in this way (at least that I have run across), I think you can be fairly confident that if you get a response that 2027 is set / set-able then you can use correct Unicode width calculations

mitchellh commented 11 months ago

allows us to query for support for proper grapheme width calculation and fall back to wcwidth if not

Thanks for linking my blog post 😄 Happy to answer any questions about this if you have them. This quoted conclusion you came to is one of my general recommendations for terminal applications: assume libc wcwidth unless the terminal responds to mode 2027 and then use Unicode standard character width.

Note (as I say in the blog post) this is still not a safe assumption. If mode 2027 is not present, terminals do ALL sorts of stuff. The only safe way to do anything without mode 2027 is to query the cursor position after any character but that's pretty terrible.

So the only reason I recommend assuming libc wcwidth is because it gives you a sound explanation of why your program behaves the way it does in the face of people reporting issues. And because in most terminals wcwidth is also how they work. But you can't bet on it.

Also note that you have to handle VS15/VS16. I'm not familiar with the Rust ecosystem, but looking at the fish library you linked it does not seem to handle VS15/16 for you (that's not abnormal). In this case, you need to modify any character width to 1 for VS15 and 2 for VS16. To be totally correct, you should only do this is VS15/16 is valid for the grapheme, which can be checked in the UCD.

helix-editor / helix

Correct Grapheme Width #6012

5997