Open gerald-brandt opened 3 years ago
Hi Gerald!
Thanks for the question.
My top priority is to prevent the terminal display from becoming garbled. When I added Unicode support, since I was not sure how to deal with these combining characters, I decided to turn them into a question mark, as explained in README.md#Text display rules.
So this is not supported at the moment, but I may be able to figure something out.
Combining characters overlay the previous character, hence the zero width.
Where in the code would I find this?
What about the space between the glyphs that should be there? Is this because of the column style layout?
Hi Gerald,
What about the space between the glyphs that should be there? Is this because of the column style layout?
I don't know what may be causing this. There is nothing special I do to add space between glyphs. This may be caused by the terminal application itself.
Where in the code would I find this?
Okay. So I created a data struct representing the contents of a displayed cell, called TScreenCell
(include/tvision/scrncell.h
) and a data struct storing the text of a cell in UTF-8, called TCellChar
(same header).
The screen is represented by a grid of TScreenCell
s. If a character is two columns wide, then this corresponds to two consecutive cells in the grid.
The current limitation is TCellChar
allowing for just one character, where in reality it could be a sequence of UTF-8 codepoints of arbitrary length.
The functions in the TText
namespace (include/tvision/ttext.h
) deal with text processing and TScreenCell
initialization. The functions that rely directly on text width are TText::eat
and TText::next
. In fact, TText::eat
is where zero-width characters are replaced with �
.
So, in order to support combining characters, the following has to change:
scrncell.h
. This will involve undoing some optimizations such as triviality. Also much of BufferedDisplay
(source/platform/buffdisp.cpp
), which manipulates cells directly.ttext.h
functions.Cheers.
http://www.unicode.org/reports/tr29/ section "3 Grapheme Cluster Boundaries" https://en.wikipedia.org/wiki/Combining_character may be related to.
It definitely is grapheme based. The combining character is part of the grapheme. It sounds like the current implementation is codepoint based, which is almost always the wrong way to do it, but so, so easy. It would be nice if it was as simple as changing a TCellChar into a string.
If utf8proc http://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html could be brought in to do all the unicode handling behind the scenes, it would probably simplify things.
Thank you everyone for your suggestions.
I tried replacing TCellChar
with std::string
, and it was a disaster. Turbo Vision likes to keep intermediary screen buffers, and has to move them around several times before data is printed to screen. So in a single screen flush, the TCellChar
constructor can be invoked millions of times. For this reason the current implementation relies strongly on TCellChar
, TScreenCell
and related structs being small and trivial, so that they occupy contiguous memory locations and can be copied with memcpy
.
You could argue that I'm coupling the system with an implementation detail, or doing premature optimization. But the truth is that representing each cell with an individual string is not a good solution to this problem. I'm pretty sure not even GUI applications store text this way.
Does Turbo Vision need to delegate Unicode processing to a external library? Actually, it doesn't. Turbo Vision is not a text editing component. What it needs to know is how text is displayed on the terminal, and this is platform-dependent, while the Unicode standard is not. So it doesn't help me at all to know that "👨👩👧👦" is a grapheme cluster if the terminal will display it differently:
Even if it's true that an arbitrary number of codepoints can fit in a single cell, I realized that:
So what I did was:
TCellChar
from 4 to 12 bytes, making it capable of holding several codepoints encoded in UTF-8.TText::eat
so that zero-width characters are combined with the previous cell. If the TCellChar
in the cell is full and no more text fits in it, nothing fatal happens: the character is simply discarded and won't be printed on screen.This preserves the already present assumptions, the most important of which is that the width of a string is the sum of the width of its characters. The performance impact of this feature is also minimal, because TCellChar
is still trivial and is 4-byte-aligned.
No changes are required in the source code of Turbo Vision applications, except those using TText::eat
or TText::next
directly (the only of which I am aware of is Turbo, which I maintain myself).
Terminals which do not respect the result of wcwidth
will suffer from screen garbling. This is the case of Hangul Jamo:
"ᅥ ᅦ ᅧ ᅨ ᅩ ᅪ ᅫ ᅬ ᅭ ᅮ ᅯ ᅰ ᅱ ᅲ ᅳ ᅴ ᅵ ᅶ ᅷ ᅸ ᅹ ᅺ ᅻ ᅼ ᅽ ᅾ ᅿ ᆀ ᆁ ᆂ ᆃ ᆄ
wcwidth
for each of these characters is 0, so I'd expect them to combine with the space before them, but many terminals (Konsole, GNOME Terminal...) display them as standalone characters. Xterm and Alacritty satisfy my expectations.
Should Turbo Vision use an external Unicode library to determine that these characters have a width of 1? Tilde is another application with good Unicode support. It treats these characters as one column wide instead of zero. Guess what, it suffers from screen garbling on Xterm and Alacritty. So you can see how difficult it is to get this right.
I suggest you to upgrade to the latest commit and try again. The Turbo text editor has also been updated.
At this point, the most improvable thing is string iteration with TText::next
and TText::prev
, which is still codepoint-based. So when navigating text with arrow keys, you will see the cursor stop at every combining character. But this doesn't worry me as much.
Cheers.
It definitely is grapheme based. The combining character is part of the grapheme. It sounds like the current implementation is codepoint based, which is almost always the wrong way to do it, but so, so easy. It would be nice if it was as simple as changing a TCellChar into a string.
Users don't care about graphemes and code points. Users do care about their experience. They just want to have all letters/signs required by their language working :)
Perhaps limiting the number of code points per screen cell may play a role in the future if real-world problems arise that may be solved by many-many code points per cell. But history shows that looking too far into the future is not always the best option. Microsoft has decided to look into the future by choosing UTF16 as the standard for their Winapi, and now they live with the most awkward Unicode representation of all.
This looks good in my quick tests. Thanks for the work!
With zero width characters being drawn as a question mark, I'm wondering how to display something like the images attached. In this case, the zero width character should place a dot over the last symbol (image 1), but instead displays a single column wide question mark (image 2)
There is also extra spacing in image 2 that shouldn't be there.
Is there a way to get the string displayed properly?
This is the string "\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\x9a\xe0\xa4\x82\x0a"