Unicode and zero width charcaters

gerald-brandt commented 3 years ago

With zero width characters being drawn as a question mark, I'm wondering how to display something like the images attached. In this case, the zero width character should place a dot over the last symbol (image 1), but instead displays a single column wide question mark (image 2)

There is also extra spacing in image 2 that shouldn't be there.

Is there a way to get the string displayed properly?

This is the string "\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\x9a\xe0\xa4\x82\x0a"

magiblot commented 3 years ago

Hi Gerald!

Thanks for the question.

My top priority is to prevent the terminal display from becoming garbled. When I added Unicode support, since I was not sure how to deal with these combining characters, I decided to turn them into a question mark, as explained in README.md#Text display rules.

So this is not supported at the moment, but I may be able to figure something out.

gerald-brandt commented 3 years ago

Combining characters overlay the previous character, hence the zero width.

Where in the code would I find this?

What about the space between the glyphs that should be there? Is this because of the column style layout?

magiblot commented 3 years ago

Hi Gerald,

What about the space between the glyphs that should be there? Is this because of the column style layout?

I don't know what may be causing this. There is nothing special I do to add space between glyphs. This may be caused by the terminal application itself.

Where in the code would I find this?

Okay. So I created a data struct representing the contents of a displayed cell, called TScreenCell (include/tvision/scrncell.h) and a data struct storing the text of a cell in UTF-8, called TCellChar (same header).

The screen is represented by a grid of TScreenCells. If a character is two columns wide, then this corresponds to two consecutive cells in the grid.

The current limitation is TCellChar allowing for just one character, where in reality it could be a sequence of UTF-8 codepoints of arbitrary length.

The functions in the TText namespace (include/tvision/ttext.h) deal with text processing and TScreenCell initialization. The functions that rely directly on text width are TText::eat and TText::next. In fact, TText::eat is where zero-width characters are replaced with �.

So, in order to support combining characters, the following has to change:

The data structures in scrncell.h. This will involve undoing some optimizations such as triviality. Also much of BufferedDisplay (source/platform/buffdisp.cpp), which manipulates cells directly.
The logic in ttext.h functions.

Cheers.

bormant commented 3 years ago

http://www.unicode.org/reports/tr29/ section "3 Grapheme Cluster Boundaries" https://en.wikipedia.org/wiki/Combining_character may be related to.

gerald-brandt commented 3 years ago

It definitely is grapheme based. The combining character is part of the grapheme. It sounds like the current implementation is codepoint based, which is almost always the wrong way to do it, but so, so easy. It would be nice if it was as simple as changing a TCellChar into a string.

If utf8proc http://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html could be brought in to do all the unicode handling behind the scenes, it would probably simplify things.

magiblot commented 3 years ago

Thank you everyone for your suggestions.

I tried replacing TCellChar with std::string, and it was a disaster. Turbo Vision likes to keep intermediary screen buffers, and has to move them around several times before data is printed to screen. So in a single screen flush, the TCellChar constructor can be invoked millions of times. For this reason the current implementation relies strongly on TCellChar, TScreenCell and related structs being small and trivial, so that they occupy contiguous memory locations and can be copied with memcpy.

You could argue that I'm coupling the system with an implementation detail, or doing premature optimization. But the truth is that representing each cell with an individual string is not a good solution to this problem. I'm pretty sure not even GUI applications store text this way.

Does Turbo Vision need to delegate Unicode processing to a external library? Actually, it doesn't. Turbo Vision is not a text editing component. What it needs to know is how text is displayed on the terminal, and this is platform-dependent, while the Unicode standard is not. So it doesn't help me at all to know that "👨‍👩‍👧‍👦" is a grapheme cluster if the terminal will display it differently:

Screenshot_20201031_172043

Even if it's true that an arbitrary number of codepoints can fit in a single cell, I realized that:

In real-world use cases of natural language, you rarely ever need more than two or three combining characters together.
Common cases where lots of combining characters are used are:
- Emojis, which as the picture above shows, are usually not grouped together by terminal applications.
- Zalgo text, which I don't care about.

So what I did was:

Resize TCellChar from 4 to 12 bytes, making it capable of holding several codepoints encoded in UTF-8.
Change the logic in TText::eat so that zero-width characters are combined with the previous cell. If the TCellChar in the cell is full and no more text fits in it, nothing fatal happens: the character is simply discarded and won't be printed on screen.
Always discard the ZERO WIDTH JOINER character, which causes emojis to get combined on a few terminals (e.g. Kitty). This ensures text is displayed in a predictable way.

This preserves the already present assumptions, the most important of which is that the width of a string is the sum of the width of its characters. The performance impact of this feature is also minimal, because TCellChar is still trivial and is 4-byte-aligned.

No changes are required in the source code of Turbo Vision applications, except those using TText::eat or TText::next directly (the only of which I am aware of is Turbo, which I maintain myself).

Screenshot_20201031_175358

Screenshot_20201031_180341

Terminals which do not respect the result of wcwidth will suffer from screen garbling. This is the case of Hangul Jamo:

"ᅥ ᅦ ᅧ ᅨ ᅩ ᅪ ᅫ ᅬ ᅭ ᅮ ᅯ ᅰ ᅱ ᅲ ᅳ ᅴ ᅵ ᅶ ᅷ ᅸ ᅹ ᅺ ᅻ ᅼ ᅽ ᅾ ᅿ ᆀ ᆁ ᆂ ᆃ ᆄ

wcwidth for each of these characters is 0, so I'd expect them to combine with the space before them, but many terminals (Konsole, GNOME Terminal...) display them as standalone characters. Xterm and Alacritty satisfy my expectations.

Should Turbo Vision use an external Unicode library to determine that these characters have a width of 1? Tilde is another application with good Unicode support. It treats these characters as one column wide instead of zero. Guess what, it suffers from screen garbling on Xterm and Alacritty. So you can see how difficult it is to get this right.

I suggest you to upgrade to the latest commit and try again. The Turbo text editor has also been updated.

At this point, the most improvable thing is string iteration with TText::next and TText::prev, which is still codepoint-based. So when navigating text with arrow keys, you will see the cursor stop at every combining character. But this doesn't worry me as much.

Cheers.

unxed commented 3 years ago

It definitely is grapheme based. The combining character is part of the grapheme. It sounds like the current implementation is codepoint based, which is almost always the wrong way to do it, but so, so easy. It would be nice if it was as simple as changing a TCellChar into a string.

Users don't care about graphemes and code points. Users do care about their experience. They just want to have all letters/signs required by their language working :)

Perhaps limiting the number of code points per screen cell may play a role in the future if real-world problems arise that may be solved by many-many code points per cell. But history shows that looking too far into the future is not always the best option. Microsoft has decided to look into the future by choosing UTF16 as the standard for their Winapi, and now they live with the most awkward Unicode representation of all.

gerald-brandt commented 3 years ago

This looks good in my quick tests. Thanks for the work!

magiblot / tvision

Unicode and zero width charcaters #26