Multi-codepoint emojis - Githubissues

willmcgugan commented 4 years ago

Hi,

Can wcwidth help me with multi-codepoint emojis?

For instance, here I want to get the cell width for a "woman_mechanic_dark_skin_tone" emoji, which renders in the terminal as 2 cells, but wcswidth reports a width of 6 because it is adding up all the modifiers.

>>> s="👩\U+1F3FF\u200d🔧"
>>> print(repr(s))
'👩🏿\u200d🔧'
>>> from wcwidth import wcswidth
>>> wcswidth(s)
6
>>> print(s+"\n--")
👩🏿‍🔧
--

I've found support for these kind of emojis to be inconsistent across terminals, so maybe this is a lost cause, but is there some kind of standard for these emoji modifiers?

jquast commented 4 years ago

I think wc/swidth should help somehow, yes. These didn’t exist in the first release of wcwidth.c this code is based upon, and since updating for new specs, I just failed to take parse them from the data files or otherwise tske them into account. This is a bug/missing feature, thanks!

willmcgugan commented 4 years ago

That's great, thanks.

Hope this doesn't complicate things too much. I've been learning about how these emoji are encoded, and all I can say is yuck.

You might know this already... there is a skin tone modifier which changes the skin tone of the preceding emoji and would have zero width. But it can also appear by itself and is rendered as a colored box if not preceded by an emoji taking up 2 cell widths (at least on iterm). That can be followed by a "zero width joiner" character which attaches another codepoint. In my first example that would be a wrench symbol, which makes the emoji a mechanic. All this was gleaned from https://emojipedia.org/

jquast commented 4 years ago

I began to draft some code for this purpose a bit ago, pushed branch https://github.com/jquast/wcwidth/tree/emoji-zwj

I think the hardest parts are done (parsing unicode data files for emoji ZWJ),WIP

tonycpsu commented 3 years ago

@jquast any update on this WIP? I was going to see if I could move the ball forward, but when I try your branch, I get:

ModuleNotFoundError: No module named 'wcwidth.table_emoji_zjw

Looks like the file containing the table wasn't checked in.

jquast commented 3 years ago

Try running tox, the tables are made by code generation, I think it is documented. I do hope to resume this issue in the next month or so, thanks for your interest

jquast commented 3 years ago

bin/update-tables.py

DragonRulerX commented 3 years ago

I just pulled wcwidth for the first time today when using tabulate in python. I decided to dive in to that code and found that tabulate relies heavily on this library. So, I figured I may as well post this here as well just in case it helps with visibility of the issue https://github.com/astanin/python-tabulate/issues/108

jquast commented 3 years ago

I think that wcswidth returning -1 for any non-printables/determinables have caused folks to rely on cheats, like sum(max(0, wcwdith(u)) for u in unicode-string), and the problem with that, is we wouldn’t be able to determine multi-code point emoji lengths,

the -1 return value is probably not a good idea for Python, it’s simply an API compatible with all other wcswidth implementations.

This WIP branch proposes a new API function, wcswidth.width that just does its best to return the width of a full string, no -1 return ability. If a control character like \n or \t is in there, we just ignore it, downstream libraries will have to do their own checks and measures for that.

As a new function, we remain API compatable, but downstream libraries will want to use the new function for this feature, which I’ll probably also try to submit to the top 10 or so.

DragonRulerX commented 3 years ago

I'm a little confused. Are you saying there is a fix for the issue I linked above or that this is still a WIP? I'm hoping to either patch in the fix myself if there is one or to pull down the new library update when it's available.

tonycpsu commented 2 years ago

Any updates here? A lot of downstream projects looking for a fix.

jquast commented 1 year ago

Fixed by #91 in today's release.

I also wrote a tool to test terminals for Emoji ZWJ for anyone interested, https://pypi.org/project/ucs-detect/

jquast / wcwidth

Multi-codepoint emojis #39