Closed Cl00e9ment closed 2 years ago
The width in cells of grapheme clusters come from the unicode standard. Your example 2 is a variation selector changing the emoji presentation of the preceding codepoint from text to emoji. emoji are rendered in two cells in terminals. I have no clue about hangul so I cant explain your last example to you. You would need to ask the makers of the unicode standard. See gen-wcwidth.py in kitty for how the functions to determine width are generated from the standard.
Thanks for the clarifications.
The width in cells of grapheme clusters come from the unicode standard.
If I'm not mistaken, Unicode don't specify the width of grapheme clusters. It's up to the implementation to define this. UAX #11 differentiates narrow and wide characters in East Asian text, but that's all.
So, are you sure that the third example (see bellow) isn't a Kitty bug but an issue with the Unicode standard?
echo -e "0123456789\n>\u1100\u1161\u11A8<"
I understand that gen-wcwidth.py assign a cell width for each scalar value, but I don't see where this directly comes from the Unicode standard (apart for East Asian scalar values).
They are width one unless they are emoji or combining marks which are width zero (with some slight subtleties read the source of gen-wcwidth.py) Its a pretty simple rule.
Yes but the rules that were defined in gen-wcwidth.py don't seam to work everywhere. The Hangul alphabet is an example of an edge case.
Maybe Hangul initial consonants should be given a size of 2, and medial vowels as well as final consonants should be given a size of 0. That's only a suggestion, as I've no idea of how Hangul works.
OK I think that I understand what's happening.
UAX #11 gives a size of 2 to Hangul initial consonants (HIC) and a size of 1 to both Hangul medial vowels (HMV) and final consonants (HFC).
The problem is that, when one HIC is followed by one HMV and optionally one HFC, they merge together to form a single grapheme cluster. The widths are added together (2 + 1 = 3 or 2 + 1 + 1 = 4) instead of using a size of 2.
someone will need to codify that then. And publish it as a standard so terminal programs can rely on it.
I fully agree.
As a side note, this issue also affects emoji combination with zero width joiner. Example with "face in clouds":
echo -e "0123456789\n>\U1F636\U1F32B\uFE0F<"
echo -e "0123456789\n>\U1F636\u200D\U1F32B\uFE0F<"
THat is a bug look at the open issue about it.
You're right, the size problem with combining emojis is a duplicate of #1978.
But further than that, I think that the rendering issue with Hangul grapheme clusters is the same bug. Even if they aren't built using zero width joiner like emoji combinations, the underlying problem is the same: multiple grapheme clusters that are rendered using a specific size, but when put together, merge to a single grapheme cluster that take less space than the sum of the previous ones.
The difference is for zwj + emoji there are well defined rules accessible to me in the unicode standard. For hangul I have no clue. As I said someone who understand hangul will either need to codify those rules and publish them or point out where in the standard they already exist in a form that can be converted to wcswidth() implementation.
For sake of understanding here's the nomenclature that I'm using:
Describe the bug Some grapheme clusters are rendered in a single cell and some other are taking multiple cells, sometimes leaving a huge blank.
To Reproduce example 1: nb of grapheme clusters: 1 nb of scalar values: 2 nb of cells used for rendering: 1
echo -e "0123456789\n>\u0067\u0308<"
This is what is expected to happen: 1 grapheme cluster = 1 cell.example 2: nb of grapheme clusters: 1 nb of scalar values: 2 nb of cells used for rendering: 2
echo -e "0123456789\n>\u2600\ufe0f<"
This is not what I would expect (1 grapheme cluster = 1 cell) but maybe it's a normal behavior. If this is normal, is there a set of rules that I can use to determine the nb of cells that a grapheme cluster will take for rendering?example 3: nb of grapheme clusters: 1 nb of scalar values: 3 nb of cells used for rendering: 4
echo -e "0123456789\n>\u1100\u1161\u11A8<"
This doesn't make any sense.Environment details