ridiculousfish / widecharwidth

public domain wcwidth implementation
Other
56 stars 12 forks source link

Shield emoji width regression #11

Closed vadi2 closed 3 years ago

vadi2 commented 3 years ago

The ๐Ÿ›ก emoji before, generated on 2020-03-21:

image

Width of ๐Ÿ›ก is reported as 2.

After, generated on 2021-04-17: image

Width of ๐Ÿ›ก is reported as 1, and makes the text overlap.

The only change is an update to widechar_width file.

Sorry we didn't mention it earlier - things were hectic at the time.

faho commented 3 years ago

This... is correct. EastAsianWidth lists it as "neutral" width (shield is U+1F6E1):

1F6E0..1F6EA;N   # So    [11] HAMMER AND WRENCH..NORTHEAST-POINTING AIRPLANE

Your terminal is rendering wrong.

vadi2 commented 3 years ago

I understand width 1 and 2, but "neutral" is not a number.

faho commented 3 years ago

From http://www.unicode.org/reports/tr11/:

Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but because for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below.

This means that "neutral" characters have width 1.

faho commented 3 years ago

And before you ask, no, you probably don't want to treat neutral characters as wide. Here's some other examples of neutral characters:

ยฉ (U+A9) ยป (U+BB) ร€ (U+C0)

I suggest you file a bug with whatever is handling the rendering and, until that's fixed, special-case U+1F6E1 in your code.

vadi2 commented 3 years ago

We're the ones building the rendering engine. Still looking for a clear answer on this since I see no relationship between the shield emoji and the copyright symbol with east asian characters... but we're getting there.

ridiculousfish commented 3 years ago

See 7e9dfdaf05059b3fff237a8619b6b4fb187570e7 . My terminals do indeed render ๐Ÿ›ก as width 1, along with ๐ŸŒถ.

Perhaps this is the rationale:

Narrow (and neutral) Unicode characters always map to halfwidth characters

faho commented 2 years ago

@vadi2 I have now given an explanation for this in our README. In short:

An upgrade for you would be quite simple. It should be enough to add

if (unicode == 0x1F6E1) return 2;

to your getGraphemeWidth function here (and any other direct uses of widechar_wcwidth):

https://github.com/Mudlet/Mudlet/blob/475fdf127ff56d33dfb318230c2ebe2cfb76a2e3/src/TTextEdit.cpp#L651-L658

This allows you to override the width that widecharwidth decides - which is correct according to Unicode, but not your renderer.

Upgrading would allow you to gain support for Unicode 14 instead of 12.

Sorry we didn't get back to you earlier, things were hectic.

vadi2 commented 2 years ago

Thanks for the writeup!

jankatins commented 2 years ago

I guess there are more emojis than the shield which should be treated like that shield emoji on a terminal? E.g. this blog post says that terminals should treat all emojis representations as width 2 (in a terminal): https://darrenburns.net/posts/emoji-in-the-terminal/ with the example of \U0001F6E5 ๐Ÿ›ฅ motorboat emoji.

Background: wezterm uses ridiculousfish/widecharwidth and in https://github.com/wez/wezterm/issues/1607 there is a discussion if the motorboat should always display as two chars or one char.

faho commented 2 years ago

I guess there are more emojis than the shield which should be treated like that shield emoji on a terminal?

@jankatins Okay, first let me super clear: The shield width should have width 1 in a terminal, because that's the width unicode says it has.

The context here is very specific. There are developers controlling both the renderer (essentially the "terminal") and the client (the app running in the terminal). They have a different need, so they add a quirk that the shield width displays in a non-standard width.

In a terminal context that's a horrible idea, because in a terminal context you don't control both applications, and both need to come up with the same width on their own (or you get awkward cursor glitches!). The only fighting chance you have of that is to go by the standard.

the example of \U0001F6E5 motor_boat motorboat emoji

As best as I can tell, motorboat is also neutral, meaning it should also have width 1. If you do anything else, you are likely to break cursor movement if it appears.

Specifically, from the linked article:

This is incorrect in the case of Emoji Presentation Sequences - Unicode recommends they should be always treated as "East Asian Wide"

See emoji-data.txt. U+1F6E5 is not listed as having "Emoji_Presentation". Instead, the range from U+1F6E0 to U+1F6E5 is listed only as "Emoji" and in emoji-sequences.txt they are always listed along with the U+FE0F variation selector as "Basic_Emoji". This leads us to believe that the default presentation for them is text presentation, meaning that they should have width 1. The "emoji presentation sequence" here is U+1F6E5 U+FE0F - both together!

Compare e.g. U+1F600 (๐Ÿ˜€), which is listed as having "Emoji_Presentation" and is by itself listed as a "Basic_Emoji". (Yes, the unicode data file format is a mess and changes too often, and there's no great explanation for any of it. That link to unicode.org seems to be a link for a consumer-facing emoji presentation presentation, and doesn't appear to have any impact on the actual presentation. It's inaccurate, the sequence should be U+1F6E5 U+FE0F)

It is of course possible that this reading is wrong, but in that case the solution is emphatically not to "treat them like the shield emoji" and quirk them out. The solution is to fix the interpretation of the standard and find a general answer.

jankatins commented 2 years ago

Thank you for the explanation!

SlySven commented 2 years ago

It doesn't help when the framework one is using (Qt) does not respect/handle the U+FE0E (Text presentation) and U+FE0F (Emoji presentation) Unicode variation selectors... QTBUG-97401