Open stevengj opened 9 years ago
In case it is helpful, the draft table of character widths that we are currently planning to use can be found in this CharWidths.txt gist (each line of which is codepoints; width
) where non-printing characters are assigned a width of 0
. This is generated automatically from the unicode 7 tables combined with font metrics from GNU unifont, as described in JuliaLang/utf8proc#27
Interesting, what terminal are you testing the "character cells consumed when printed" on OSX? I too am using OSX, and on iTerm2 it displays as "a-b", consuming 3 characters, so wcwidth would be correct, here... I would need to see evidence of it not forwarding the cell when printed on at least some terminal emulators, and file bugs for the others. Just to be very clear, the purpose of wcwidth is "printable width on a terminal", and not firefox or anything else (for which such character is hidden).
Also, I don't necessarily trust the OS-provided 'wcwidth', they are typically based on very old (5-10 years old) unicode specifications. I have a program I've tested on osx and linux, both are wildly different, and in each case my version was correct: https://github.com/jquast/wcwidth/blob/master/bin/wcwidth-libc-comparator.py
The combining and wide character tables are programmatically updated by "python setup.py update", which is similar to your https://github.com/JuliaLang/utf8proc/pull/27/files#diff-3832b9cfe2fc10d35ac5c63d9b7b8133R20
There is no unicode specification reference tables for 0-width characters that I know of, so its just hardcoded here https://github.com/jquast/wcwidth/blob/master/wcwidth/wcwidth.py#L161-171
Using the 'Cf' category listings on iTerm2, it appears the following all consume 1 character cell, some with symbols, some simply by blanks
00AD
0600
0601
0602
0603
0604
0605
061C
06DD
070F
200E
200F
202A
202B
202C
202D
202E
2060
2061
2062
2063
2064
2066
2067
2068
2069
206A
206B
206C
206D
206E
206F
FFF9
FFFA
FFFB
And the following consume 0 cells:
180E
200B
200C
200D
FEFF
which may indeed need to be supported by wcwidth once i test a few more terminals
(We don't trust the system-provided wcwidth
either, for the same reason as you, which is why we compute the widths independently. However, the OSX 10.10.2 wcwidth
agrees with our results when it returns a nonnegative value, so it mostly seems to have errors of omission—it returns -1
for many valid printable characters from recent Unicode standards. Moreover, U+00AD has been part of Unicode since 1993, so I would think that most wcwidth
implementations would handle it properly.)
There is an interesting article on the soft hyphen, which apparently has had a controversial history, and is rendered in different ways depending on the font and the rendering system. I'm not sure what the right answer is here, but the Unicode standard seems to somewhat favor the viewpoint that it should be invisible although it leaves it up to the implementation. However, the article mentions that the Unicode FAQ does say In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances and maybe that is what is done in practice.
cc: @jiahao and @StefanKarpinski.
Note that the Arabic characters U+0601 etc. are defined by the unicode standard as exceptions to usual rule that Cf characters are invisible:
In contrast, e.g. U+200E is a left-to-right mark, and in my understanding is defined as an invisible formatting character that controls the direction of the text. Some terminals may give it a nonzero width (although the MacOS Terminal with the default font gives it zero width on my machine), but that seems like a bug in the terminal (or the font); it seems like it is better to return what the Unicode standard says rather than propagating a particular buggy implementation.
I remember that article about the soft hyphen. Under "Modern Unicode semantics" it references UAX 14 for Unicode 7.0.0, §5.4, which says:
Unlike U+2010 hyphen, which always has a visible rendition, the character U+00AD soft hyphen (shy) is an invisible format character that merely indicates a preferred intraword line break position. If the line is broken at that point, then whatever mechanism is appropriate for intraword line breaks should be invoked, just as if the line break had been triggered by another hyphenation mechanism, such as a dictionary lookup.
The description in the following paragraphs suggests that the rendering of a soft hyphen is accomplished not by printing the soft hyphen itself, but rather by inserting an additional, printable hyphen glyph:
The inserted hyphen glyph can take a wide variety of shapes, as appropriate for the situation. Examples include shapes like U+2010 hyphen, U+058A armenian hyphen, U+180A mongolian nirugu, or U+1806 mongolian todo soft hyphen.
Based on this description it would seem that the character U+00AD by itself is nonprintable and should have a width of 0 or -1.
Interestingly, the Unicode FAQ entry that the SHY article quoted seems to no longer exist — from that passage in the Unicode 7.0.0 standard that @jiahao quoted it seems like the Unicode consortium decided to put its foot down and and declare that the soft hyphen is definitely invisible, ISO 8859-1 be damned.
I really appreciate all of the resarch, @stevengj and @jiahao.
My decision is to use the common denominator across the most popular terminal emulators for wcwidth. I might make a note of it in the readme that it deviates from the standard, as the primary purpose of this project is how text is displayed by the most common (utf-8 capable) terminal emulators.
I've made a checklist:
Then, test the following and report:
I'm not sure how to gauge the "popular terminal emulators", this is just from memory.
sidenote: More importantly, how to factor their weight in wcwidth for any given
differences: perhaps some way to configure how the printable width
of such discrepancies may be reported if the consumer of wcwidth
knows their target audience's emulator (unfortunately all such terminals
borrow the common value "xterm" or "xterm-256color" as the OS
Environment Variable for TERM
, and using the response of
the "answerback sequence" (^E) which at least PuTTY replies
to, but I'm afraid thats far out of scope for wcwidth, it would
require interaction with a terminal driver.
Finally, we can make a PR and release any update.
@jquast thanks for your detailed consideration. As you had stated above, iTerm seems to have different needs from us at this point.
However, I don't think it is possible to provide consistency across terminal environments without considering also the interactions with the choice of users' fonts. Many fonts simply have wrong advance widths for some code points.
Here is a simple rendering text for the fixed width fonts on my system. Consider
U+003C9 U+00302= \omega\hat = ω̂
should render with the hat combining character on the omega.
U+00302 U+003C9 = \hat\omega = ̂ω
should render with a hat to the left of omega.
You are correct, but terminal emulators don't typically care, they're the ones who handle the width of "printable cells" -- What is your system, is it a terminal emulator?
The screenshots I pasted were taken from an IPython notebook rendering test HTML using those fonts. I can see the same spacing issues if I manually change the font in OSX Terminal and generate these characters in the Julia console REPL.
Version wcwidth 0.1.5 which includes better combining character width determination by PR #11 is available on pypi.
A terminal sequence may be emitted to illicit the terminal emulator to respond with its cursor position.
This can be used to manually display all questionable characters across different popular Font face profiles and terminal emulators, and programatically determine whether they consider it 0 width for such characters, making a report of the most common discrepenancies, weighing on the side of "most correct", resolving any.
This is closed by https://github.com/jquast/wcwidth/pull/91
About U+00AD
in particular, it is part of the Cf
category, and the entire category of 'Cf' is now classified as zero-width, along with 'Mc', 'Zl', 'Zp', and part of 'Sk' category. I have written this specification that describes precisely how the width of characters are determined https://github.com/jquast/wcwidth/blob/master/docs/specs.rst#width-of-0 I hope it is helpful.
This issue also talked about the need to best match the behavior of popular terminals. I have also published an automatic testing tool for wide, zero, combining, and emoji zwj sequences. Though this only works with python's wcwidth, the technique would be very easy to copy to or aide other languages or wcwidth implementations, https://pypi.org/project/ucs-detect/
And finally, "BIDI" text was mentioned, I suggest to see related resource https://gist.github.com/XVilka/a0e49e1c65370ba11c17 about the state of BIDI, it has had some traction in the last few years, in any case the 'ucs-tool' appears to verify left-to-right text with wcwidth is ok. The LTR marker is 0-width.
For reference, in glibc wcwidth(0xad)
appears to be 1.
Judging by this discussion: https://sourceware.org/bugzilla/show_bug.cgi?id=22073 which concluded that it should be 1.
That discussion took place in 2017 - after the main discussion in this issue, but before the last https://github.com/jquast/wcwidth/issues/8#issuecomment-1785907194 here.
Also, in musl-libc, 0xad
is also of wcwidth 1.
It's a bit ambiguous isn't it? From https://codepoints.net/U+00AD,
is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they are fall on the line end but remain invisible within the line.
I will add a test to ucs-detect and whichever measured width (0 or 1) that is used among the most popular and compliant terminals will be used in this library.
For what it's worth, the musl-libc maintainer, @richfelker said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption".
Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?).
Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc.
And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice.
Thanks for relaying @richfelke https://github.com/richfelker‘s thoughts, I’m in full agreement with all of them, especially for -1 as this kind of character is meant to be managed by the terminal emulator, and it’s width is indeterminate (like \n, \t, etc). But if the most popular terminal emulators measure it as width of 1 then I’d like to match
-- Jeff Quast @.***
On Wed, Mar 13, 2024, at 2:53 PM, avih wrote:
For what it's worth, the musl-libc maintainer, @richfelker https://github.com/richfelker said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption".
Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?).
Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc.
And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice.
— Reply to this email directly, view it on GitHub https://github.com/jquast/wcwidth/issues/8#issuecomment-1995380988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNOKBLY2HT5NLTDSE46HDYYCOC7AVCNFSM4A5VFAVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJZGUZTQMBZHA4A. You are receiving this because you modified the open/close state.Message ID: @.***>
if the most popular terminal emulators measure it as width of 1 then I’d like to match
Right.
I would guess that terminals measure its width according to the wcwidth
implementation which they use? And I would also guess that typically that would be whatever libc provides? (not including windows terminal, which brings its own implementation, because on windows there's no system wcwidth).
And so ultimately, I would think the goal should be agreement between wcwidth implementations, rather than between this implementation and the behavior of popular terminal emulators?
Ultimately, the utf8proc library decided to also report a width of 1
for U+00AD as well, in order to agree with other wcwidth
implementations, and with typical terminal programs which display a soft hyphen as a visible -
glyph.
I would guess that terminals measure its width according to the
wcwidth
implementation which they use? And I would also guess that typically that would be whatever libc provides?
Well, that was not a good argument, and I would agree that if this was the only or main wcwidth implementation, then it should try to match the common terminal emulators behavior.
But because this is one of several wcwidth implementations, its goal should be to agree with other wcwidth implementations rather the terminals.
That being said, it would still be nice to know how terminals handle it.
At which case, the test should be dual:
I would guess that most terminals don't handle it dually like the Unicode semantics suggests (and would imply a -1 wcwidth value), hence they probably treat it as always 1 or always 0, though that's a guess.
At which case, the test should be dual...
So, I tested it in the following terminals on Alpine linux 3.19.1, and all the tested terminal emulators treat it either as hard 0 or hard 1. I.e. no terminal handles it dually as 0 at the middle of the line and hyphen+wordbreak in a word which spills over the end of the line.
Specifically, I tested using this script, and observed the result on-screen (not automated). the SHY byte is always at this word xxx<SHY>yyy
:
EDITED: THIS SCRIPT IS BROKEN AND THE RESULTS ARE INVALID. See fixed script at the next post.
All the terminals were invoked with UTF-8 locale, e.g.:
LC_ALL=en_US.UTF-8 xterm
Results:
xterm
388, VTE (tested {gnome,xfce4,lx}-terminal), konsole
23.08.4, and st
0.9: always display it as U+FFFD REPLACEMENT CHARACTER, as if wcwidth(0xad) == 1:
urxvt: similat to xterm etc. above, but always displays it as a hyphen, as if wcwidth(0xad) == 1.
alacritty 0.12.3 and kitty 0.31.0: seem to ignore it at the input, as if wcwidth(0xad) == 0:
So while 1 is common, I don't think it's black and white.
So I would think the goal should be to match other wcwidth implementations, where the value appears to be 1 at least in glibc, musl, and utf8proc.
Actually, the test script above is wrong. It printed the byte 0xad (which is invalid UTF-8 sequence) rather than the UTF-8 sequence for U+00AD - which is 0xc2 0xad
.
This is the revised script:
And these are the results at the various terminals (kitty doesn't have "kitty" at the title, and xfce4-terminal and gnome-terminal have the same result as lxterminal - as all are VTE-based):
Like before, this is on Alpine linux 3.19.1 with the terminals installed from the distro packages repository, and all terminals were invoked after exporting LC_ALL=en_US.UTF-8
.
Results:
Here's a summary of the U+00AD SOFY-HYPHEN behavior:
Therefore I think it should be added/restored as an overriding exception - return 1 for 0x00ad
, to reflect terminals behavior and align with other wcwidth implementations.
utf8proc now returns 1 as well (https://github.com/JuliaStrings/utf8proc/pull/135).
Hi, I was looking at your
wcwidth
library for comparison, since in the utf8proc library we are also implementing a similar feature (see JuliaLang/utf8proc#2). The first disagreement that I came across between your implementation and ours was for U+00AD (soft hyphen), where you seem to give1
and we give zero (a soft hyphen is used for line breaking, but is ordinarily not printed). In general, we return 0 for most characters in category Cf (formatting control characters). The
wcwidth
function on MacOS 10.10.2 also returns-1
(not printable) for this code point.Am I calling your implementation incorrectly? This is for git
master
of wcwidth.