jquast / wcwidth

Python library that measures the width of unicode strings rendered to a terminal
Other
389 stars 58 forks source link

Spesification: mention width for variation selector VS16 (U+FE0F) and VS15 (U+FE0E) #109

Closed erf closed 7 months ago

erf commented 8 months ago

I haven't looked into the code yet but i would think you check for variation selector VS16 (U+FE0F) and give that width 2 and width 1 for VS15 (U+FE0E) and that this should be spesificed in the specs

https://en.wikipedia.org/wiki/Variation_Selectors_%28Unicode_block%29

jquast commented 7 months ago

Ah sorry, it is in the specification, but there seems to be an issue with readthedocs integration, it failed to rebuild the docs when it was last updated to include it. I'll close this when that is fixed

erf commented 7 months ago

BTW do you really need a table to check for U+FE0F ? Isn't it enough to just check the codepoints of the Unicode string in question ?

jquast commented 7 months ago

I do not really need the table, but it is helpful to be more accurate.

Without a lookup table, we would incorrectly measure The letter 'a' followed by U+FE0F as a width of 2.

This is similar to other combining characters, I wrote about it briefly here under section "Zero Width",

While it may be possible to combine some combining characters with any other Unicode characters [...] this is not the case for most combining characters, which can only combine with specific characters

Because emoji-variation-sequences.txt is published, it is possible to generate this table and easily make this distinction in code.

If it were also possible to easily determine which combining characters would result in the "dotted donut" described in the article, I would also do that. But, because unicode.org does not publish data files about this, wcwidth does not make any attempt to make this distinction.

I wrote about this in more detail in the recent update for Hangul Jamo https://github.com/jquast/wcwidth/blob/36a625179ed2675287fe6b61c2ad319406449e60/tests/test_core.py#L232-L238

In this case, both characters are Wide when displayed by themselves (2+2=4). But, they may also combine when in sequence (2+0=2). This library takes the approach that the JUNGSEONG (vowel) is only expected to be displayed in sequence for combination.

erf commented 7 months ago

Thanks for the explanation. It seem it's safer to use the document even though you're probably not bound to find a combining character where it's not supposed to be found.

BTW, i like the idea of this specification for your python wcwidth implementation, however have you thought about contributing to a more official spec of render width , a collaboration with other authors of similar libraries (zigglyph, utf8proc, etc) or terminal app developers can look to for guidance. The Unicode documents is there but a bit confusing when trying to figure out render width. Been struggling with it myself when working on my editor vid. Sorry to post too much here, but not sure i should make a new issue for this.

Also, if you're interested Mitchell (working on Ghostty now - im in beta) did an interesting write up on grapheme clusters in terminals where he mentions mode2027 for unicode support in terminals.

jquast commented 7 months ago

No problem, I'm happy to discuss all of these things, especially now as I'm currently unemployed, and, because this wcwidth project receives the most downloads I have decided to give it the most attention for the time being. If there is any common place for terminal developers to have a discussion I would definitely like to join.

And after spending far too much time on unicode.org documents to make this very basic python library, I was driven to publish this open specification of wcwidth() and wcswidth() behavior. I am on a bit of a mission to get this specification correct, and to get terminals to comply to it. I will be opening bugs and PR's with open source Terminal Emulators over the coming weeks. Outside of this specification and the ucs-detect tool that I have authored, I'm not sure I can be useful to these other projects.

I did check out your vid editor, I was wondering what made you so interested in the specification and I presumed that was it. Great work by the way!! I have also written some terminal editors in the past, some of them vi-like, it is a great experience! I have had some ideas of integrating a vi-like editor directly into a keyboard.

I have read Mitchell's articles on ghostty, I appreciate the developer blog posts, I didn't know he has shared the program with anyone else until you mention that you are in a beta. You might like to use the ucs-detect tool on ghostty, as a beta user I would encourage for you to share with Mitchell about this tool and its results.

I did review the contour spec a few weeks ago... I couldn't use contour on my primary MacOS machine without upgrading the OS and I'm not really inclined to do that right now so I have given it very little attention, image

but I did briefly use it on Windows 10 for the ucs-detect tool, results here https://ucs-detect.readthedocs.io/sw_results/Contour.html#contour

On the specification, I can only say that I appreciate the ability to query for the support or current state of mode2027.. this is the most difficult problem with terminals, that you cannot reliably query them for support of many features. How can we know if a terminal supports 24-bit color, sixel, or the proprietary sequences that have been added to Kovid Goyal's Kitty or iTerm2 and sometimes replicated in other terminals? We cannot even reliably determine what terminal is in use, or its version, much less its features. Many of these proprietary enhancements do not come with a "do you support this?" sequence, a terrible oversight.

But, it is easy enough to test a terminal for support of ZWJ, VS-16, and even the version level of unicode for wide characters or whether certain languages with zero-width combining sequences are supporting through "Query Cursor Position", as ucs-detect does (see "how it works").

And so, it should be possible to determine "Does it support VS-16?" and act accordingly by adding an extra padding space character. Though I will be opening bug reports and PR's with popular terminals to support it, anyway, so I hope we won't need to do that.

On contour's specification, I found it terribly ambiguous and not very comprehensive, as my ucs-detect tool shows, almost all terminals support some level of "grapheme cluster processing" but not for every kind.. there is a lot of undefined behavior in the specification and it defers far too much to unicode.org specifications which are not written with terminals or fixed-width fonts with grid representation in mind, and those specifications are also very ambiguous! A lot of "may" and "could" kind of language.

I also don't see any purpose in disabling "grapheme cluster processing". I can't think of any single reason to do so!

erf commented 7 months ago

Thanks for your answer! The guys over at Ghostty have already tested your unicode tool and all tests was OK, except that zigglyph library they use don't yet support Unicode version 15.1, but 15, but this will probably be fixed soon. Also there was recently introduced a change in Ghostty which made your tool crash, so i made an issue on that. I made a lot of issues on Ghostty regarding emoji support whilst working on my editor, which Mitchell has fixed up effectively! Also Ghostty supports mode 2027 by default (but is optional). The Ghostty terminal is shaping up great and is my daily driver. You should definintly join the beta on the Discord app. I'm sure Mitchell will let you in sooner than later .. I too have had more time on my hands lately so been able to tinkering with various things which have been quite fun .. Your work on this wcwidth specs is a great and i think it can be a great resource for terminal developers and i hope it can get more eyes across languages !

jquast commented 7 months ago

Just to add, the libvte developer Egmont Koblinger refuses to implement support for VS-16 :( It appears that he thinks I have made up a standard artificially, when I have only clarified a specification from unicode.org technical reports and existing implementations in 9+ other terminals.

You are pushing for a behavior that you and some other players would like to have, some have already implemented (thereby breaking a previously consistent ecosystem), and is not by any means a standard that the vast majority of involved parties would agree on.

I must have not communicated very clearly, he appears to be a very difficult person https://gitlab.gnome.org/GNOME/vte/-/issues/2580#note_1978442

I will continue to try to gain VS-16 support in other terminals, anyway.

erf commented 7 months ago

I wouldn't spend too much energy convincing people who doesn't want too keep up with modern standards, luckily there are modern alternatives like Ghostty (not released yet but probably will soonish) and Wezterm.