Konsole: overly wide Unicode characters mess up intended layouts

electroly commented 3 years ago

I'm trying to find a solution to this rendering issue in Konsole.

konsole

This is a TEditor example showing how the line formatting gets shifted around due to those ❺ symbols. Also, the title of this window is clipped (there's no close parenthesis). These symbols are rendered slightly wider than a single cell, leading to the rest of the printed line being shifted out of alignment. Other terminals show this symbol in a single terminal cell, but Konsole is painting the line with variable character widths with chaotic effects.

Here's my evolution of the ASCII table from tvdemo. On the right you can see some extra-wide characters that mess up the whole line. ⑪-⑯ and ❽-❿ are clipped. When you click on one, it selects the "wrong" character due to the rendering discrepancy.

Open to suggestions on this. Konsole is the only terminal I've tested so far with this issue.

magiblot commented 3 years ago

This is a bug in Konsole's bi-directional text rendering support and there's nothing Turbo Vision can do about it. The problem goes away if you disable this feature from Settings > Edit Current Profile > Advanced > Uncheck Bi-Directional text rendering.

Change that setting and share the results, because there may still be something wrong with these apparently double-width characters.

electroly commented 3 years ago

It was checked, but unchecking it didn't seem to fix it for me. I unchecked it, hit OK, then closed and reopened Konsole just in case.

magiblot commented 3 years ago

Disabling bi-directional text rendering at least improved the cursor movement issue for me. But in your case it looks that Konsole is additionally rendering certain characters as double-width when they are not, I don't know why.

Can you please run the following program and share the output?

#include <locale.h>
#include <wchar.h>
#include <stdio.h>

struct UChar { const char *mbc; wchar_t wc; };

const UChar chars[] =
{
    {"a", L'a'},
    {"❶", L'❶'},
    {"⑪", L'⑪'},
    {"🤡", L'🤡'},
};

int main()
{
    setlocale(LC_ALL, "");
    for (const auto &ch : chars)
        printf("wcwidth(%s) = %d\n", ch.mbc, wcwidth(ch.wc));
}

I get the following:

wcwidth(a) = 1
wcwidth(❶) = 1
wcwidth(⑪) = 1
wcwidth(🤡) = 2

magiblot commented 3 years ago

Actually, please share a screenshot of the result so that we can see whether wcwidth returns a different result for you or Konsole is simply not respecting it.

Here's mine: Screenshot_20210327_215500

electroly commented 3 years ago

Here's what I get.

They look like 1.5-width to me. It's no longer aligned with the character grid after the ❶ character.

electroly commented 3 years ago

In my earlier screenshot of the character picker, I notice the scrollbar is in the correct spot and it cuts off the characters inside the list, rather than itself being shifted to the right. That makes me think that if we had a list of the affected characters, this could be worked around by always explicitly moving to the expected screen position after writing one of those characters. That would hopefully cut off the right side of the over-wide character and allow the rest of the line to be in the right spot.

magiblot commented 3 years ago

Well, it's clear that there's something wrong with how Konsole chooses to render these characters. I suggest you try another terminal emulator (alacritty, gnome-terminal, kitty, or even xterm). If the problem persists in any of these, then the problem may lie in the font rendering libraries. Otherwise, it may be an issue unique to Konsole (which version are you using, BTW?).

In my earlier screenshot of the character picker, I notice the scrollbar is in the correct spot and it cuts off the characters inside the list, rather than itself being shifted to the right.

This suggests to me that Konsole is aware these characters are actually just one cell wide, but for some reason they are rendered wider than they should.

That makes me think that if we had a list of the affected characters, this could be worked around by always explicitly moving to the expected screen position after writing one of those characters. That would hopefully cut off the right side of the over-wide character and allow the rest of the line to be in the right spot.

I have never experienced this issue before, so it can be assumed that not all Konsole users suffer from it. Then, how would Turbo Vision detect whether this issue is happening or not? I don't think enabling this workaround unconditionally would be very comfortable.

Cheers.

magiblot commented 3 years ago

Also, please try using different fonts (mine is Hack).

electroly commented 3 years ago

This is Konsole version 17.12.3. Changing the font does fix it. The fonts preinstalled on Ubuntu, "DejaVu Sans Mono", "Courier 10 Pitch", "Nimbus Mono L", and "Noto Mono", all produce the over-wide characters. My favorite third party font (Iosevka) looks correct. I wonder if this is some kind of font fallback issue, maybe ❶ doesn't exist in any of the built-in mono fonts.

A workable solution for me is to simply omit these characters from the symbol picker, but they are handy-looking glyphs that I'd like to salvage if I can. Another workable solution for me is to ignore the problem and just let it be broken on Konsole. These characters look good in every other system and terminal combination I've tried.

magiblot commented 3 years ago

Another workable solution for me is to ignore the problem and just let it be broken on Konsole. These characters look good in every other system and terminal combination I've tried.

That's what I would do. At the most, this issue can be documented somewhere so that in the unlikely case a user runs across it, they can fix it themselves.

electroly commented 3 years ago

Works for me. Thanks!

unxed commented 2 months ago

wcwidth() lies in a huge number of cases. The only reliable way to determine the actual character width is as it is done for Windows, by outputting the character and measuring the cursor offset. By the way, dividing a line into grapheme clusters is possible using the same method. An example in Python is here: https://github.com/elfmz/far2l/issues/2378#issuecomment-2336818193

magiblot commented 2 months ago

Hi @unxed!

I'm afraid that solution is only feasible on Windows. In order to do that in a Unix terminal:

You would have to print the characters you want to measure in the same screen area where your application is being drawn. For example: if you tried to print an emoji, then measure the cursor movement, then move the cursor back to its initial position, and then overwrite the emoji with the characters that used to be in that part of the screen, it is very likely that the terminal emulator would display the emoji on screen for some time.
Even if the characters being measured didn't appear on screen, or if that wasn't an issue, the performance of the whole process would be very poor.
Even if the above weren't an issue, the input stream used for reading the terminal state is the same that's used for reading user input. Therefore, you would have to either ignore user input while measuring text width, or write code that is able to keep the input events that are received while measuring text width. And then you would also have to consider the risk of waiting forever for an answer from the terminal...

So, in my opinion, you would end up with a poor experience for both the user and the programmer.

unxed commented 2 months ago

Why not output chars outside the visible area, above it?

magiblot commented 2 months ago

I haven't actually tried that. But I suspect that drawing outside the visible area only makes sense if there is a scrollback area. Turbo Vision uses the alternate screen buffer, which results in scrollback being disabled in most terminal emulators. At this point, I expect that the terminal won't allow the cursor to be moved out of bounds. So, it looks to me that such a strategy is likely not to work in many terminal emulators, and therefore it would not be portable. Besides the fact that it would tackle just the first of the problems I mentioned.

unxed commented 2 months ago

Could whose problems be solved using atomic updates proposal as described here: https://gitlab.com/gnachman/iterm2/-/wikis/synchronized-updates-spec

?

magiblot commented 2 months ago

Synchronized updates allow you to avoid having the characters you are trying to measure shown on screen. But that just solves the first issue I mentioned. In addition, it introduces more steps to the process of measuring a character's width, so the performance may be even worse depending on how you use it.

As absurd as it may sound, I think the only way around the issues I mentioned is to tacke this issue the other way around, and have the client application tell the terminal whether the characters it is printing should be displayed as single- or double-width. This way the application's expectations would match the actual display, so there could be no screen garbling because of width mismatches. And since this would just require a one-way communication from the application to the terminal, it would avoid the performance penalty of having to wait for replies from the terminal, and it would also avoid any conflicts with user input.

unxed commented 2 months ago

That sounds pretty reasonable! Could you offer a draft standard? It could be implemented, for example, in far2l built-in terminal or maybe in kitty if the author is willing to do so.

unxed commented 2 months ago

By the way, since we are talking about Unicode support. Could you please tell me if grapheme clusters can have varying displayed widths depending on neighboring grapheme clusters? Or is the width of a grapheme cluster a fixed value? I haven't been able to figure this out yet, maybe you know? Thank you!

magiblot commented 2 months ago

Hi @unxed. Regarding your question on grapheme clusters, I don't know much about these details of the Unicode specification, so I can't help you. All I know is that I cannot expect the average terminal emulator to be fully compliant with Unicode. Not just because some terminals never intended to be compliant in the first place, but also because the specification evolves over time and implementations may become outdated (e.g. the internet is full of different implementations of wcwidth each of which adheres to a different version of Unicode).

A clear example of this was the behaviour of Kate's embedded terminal widget by the time I wrote the following comment: https://github.com/magiblot/tvision/issues/26#issuecomment-719964250

Does Turbo Vision need to delegate Unicode processing to a external library? Actually, it doesn't. Turbo Vision is not a text editing component. What it needs to know is how text is displayed on the terminal, and this is platform-dependent, while the Unicode standard is not. So it doesn't help me at all to know that "👨‍👩‍👧‍👦" is a grapheme cluster if the terminal will display it differently:

That's why I think that attempting to solve the issue of character widths by focusing on Unicode standard compliancy is not the best idea. For this to work, both the application and the terminal emulator should either implement these complex Unicode logics, or rely on third-party dependencies that implement such logics. Even if they did so, the Unicode support in them would inevitably be in risk of becoming outdated, unless both these programs and/or the systems where they would be running kept receiving updates.

Considering that one of the main points of text-based applications is portability (e.g. being able to run in a remote host), it seems to me that tackling this issue in this way would be senseless.

Having the client application ask the terminal the width of text is a possible solution, but it will only work performantly in specific scenarios with very low latency. It takes at least 1 write operation and 1 read operation to measure the width of one character; the time it will take you to complete the whole process is proportional to the latency of the connection between the client and the terminal and to the amount of characters you need to measure, and therefore this is clearly not viable in many cases.

A serious proposal for the solution I mentioned in my previous comment would require considering a lot of things into account, since this is not just about single characters.

For example, a client application may want to ensure that displaying "👨‍👩‍👧‍👦" (consisting of 5 Unicode codepoints) will occupy just two screen columns. The terminal may not know how to render this grapheme cluster properly (as in the previous example of Kate's embedded terminal), so inevitably these characters won't be displayed the way the application expected, but they should still occupy exactly two screen columns, since messing up the application's layout can be avoided.

For a standard proposal to be effective in solving this issue, it should provide clear hints for terminal emulator developers on how to handle this situation and many other ones. But I have never developed a terminal emulator and I am not familiar with font rendering, so I have no idea what it makes sense to ask the terminal emulator to do and what it doesn't.

unxed commented 2 months ago

Taking into account everything we've discussed, the only solution that comes to mind is to pass a set of rules (describing how to split a string into grapheme clusters and determine the width of these clusters) from the terminal to the application (or vice versa) at the app start. Because it doesn’t seem like the Unicode standard logic is now actively changing between versions, but just new characters are being added.

Currently, such rules are usually statically compiled into the application. If they are made dynamically loadable, this could solve the issue, although it would result in a slight delay when launching the application. As for terminal support, we could experiment with this in the built-in far2l terminal, and if we find a sustainable solution, we could propose it to other developers.

What do you think of this approach?

unxed commented 2 months ago

Here's another idea. Perhaps we could develop a protocol that allows the terminal and the application to agree on the highest Unicode standard version they both support and then operate using that version. If this protocol isn't supported, we could fall back to the current approach.

magiblot commented 2 months ago

The point of my suggestion was that things should be made as simple as possible for both the client application and the terminal emulator.

Turbo Vision currently uses the system-provided wcwidth function on Unix systems (except on the Linux console, which works differently). Thus, Turbo Vision cannot know what version of Unicode is being taken into consideration (if any, because the implementation of wcwidth may be arbitrary in some systems), and the protocol you suggested in https://github.com/magiblot/tvision/issues/51#issuecomment-2360970698 for negotiating Unicode versions would not help. It could work if it was reasonable to expect the average text-based application to be fully aware of the Unicode version it's using when deciding the width of its characters, and then the terminal emulator should be up-to-date and support many different Unicode versions. I think this would be very difficult.

Similarly, I think that having the client application and the terminal emulator talk to each other about rules describing how to split a string into grapheme clusters and determine the width of these clusters does not sound much simpler. What would those rules be like? How much code would it take in the client application to support that?

When writing about my suggestion, I was thinking of something like this:

During application startup, the client emits a escape sequence which informs the terminal that "by default, none of the characters which I may print should be rendered as double-width".
The client then prints text consisting of characters that are not double-width, according to the client's understanding. (If the terminal understands that any of these characters is double-width, it shall render it in a single column anyway, either by making it smaller or by having adjacent columns overlap it).
When the client wants to print a double-width character, it emits a escape sequence informing the terminal that "the following text makes up a grapheme cluster of width 2", then it prints the character and a terminating escape sequence. (If the terminal understands that the character is just one column wide, it shall render it in two columns anyway, either by placing it right between two columns or by stretching it).
The same happens when the client wants to print a grapheme cluster consisting of multiple characters, and it wants to make sure that it will be rendered properly. (If the terminal does not know how to render such characters as a grapheme cluster, e.g. it would have rendered "👨👩👧👦" instead of "👨‍👩‍👧‍👦", then it may apply a workaround such as dropping excess characters).

This wouldn't ensure that all grapheme clusters are rendered properly, but it would prevent the client application's layout from messing up.

In addition, the terminal may have to reply to some of the escape sequences so that the client knows this feature is supported.

But, as I said, maybe implementing this in a terminal emulator is very complex and unconvenient. I don't know.

unxed commented 2 months ago

The point of my suggestion was that things should be made as simple as possible for both the client application and the terminal emulator.

@elfmz can you please look into this? Can we support this experimental approach in far2l's VT?

o-sdn-o commented 1 month ago

But, as I said, maybe implementing this in a terminal emulator is very complex and unconvenient. I don't know.

I tried to implement an approach in which the application informs the terminal about the size of the grapheme cluster on a per-cluster basis. And the terminal simply has to fit the grapheme cluster into the required matrix of cells.

You can play with it using vtm built-in terminal (vtm -r term) on Windows (X11 support is not implemented yet).

Example 1. Output 3x1 character.

pwsh:
```
"👩‍👩‍👧‍👧`u{D0033}"
```

wsl/bash:

printf "👩‍👩‍👧‍👧\UD0033\n"

Example 2. Output 6x2 character.

pwsh:

"👩‍👩‍👧‍👧`u{D00C9}`n👩‍👩‍👧‍👧`u{D00F6}"

wsl/bash:

printf "👩‍👩‍👧‍👧\UD00C9\n👩‍👩‍👧‍👧\UD00F6\n"

Output:

The explicitly specified codepoint (joining modifier) is taken from the Unicode codepoint range 0xD0000-0xD02A2 (not allocated yet range), the value of which is encoded by the "wh_xy" literal value enumeration:

w: Character matrix width
h: Character matrix height
x: Horizontal fragment selector inside the matrix
y: Vertical fragment selector inside the matrix

If you dive deeper, you can get the following things with rotation, mirroring and halves:

o-sdn-o commented 1 month ago

I've updated the draft: Unicode Character Geometry Modifiers

unxed commented 31 minutes ago

Btw, iTerm2 has ESC sequence to specify Unicode version for characters with detection: https://iterm2.com/documentation-escape-codes.html#:~:text=Unicode%20Version

Unicode Version

iTerm2 by default uses Unicode 9's width tables. The user can opt to use Unicode 8's tables with a preference (for backward compatibility with older locale databases). Since not all apps will be updated at the same time, you can tell iTerm2 to use a particular set of width tables with:

OSC 1337 ; UnicodeVersion=[n] ST

Where [n] is 8 or 9

You can push the current value on a stack and pop it off to return to the previous value by setting n to push or pop. Optionally, you may affix a label after push by setting n to something like push mylabel. This attaches a label to that stack entry. When you pop the same label, entries will be popped until that one is found. Set n to pop mylabel to effect this. This is useful if a program crashes or an ssh session ends unexpectedly.

magiblot / tvision

Konsole: overly wide Unicode characters mess up intended layouts #51