jquast / wcwidth

Python library that measures the width of unicode strings rendered to a terminal
Other
391 stars 58 forks source link

wrong width for U+00AD #8

Open stevengj opened 9 years ago

stevengj commented 9 years ago

Hi, I was looking at your wcwidth library for comparison, since in the utf8proc library we are also implementing a similar feature (see JuliaLang/utf8proc#2). The first disagreement that I came across between your implementation and ours was for U+00AD (soft hyphen), where you seem to give 1

>>> from wcwidth import wcwidth
>>> wcwidth(unichr(173))
1

and we give zero (a soft hyphen is used for line breaking, but is ordinarily not printed). In general, we return 0 for most characters in category Cf (formatting control characters). The wcwidth function on MacOS 10.10.2 also returns -1 (not printable) for this code point.

Am I calling your implementation incorrectly? This is for git master of wcwidth.

stevengj commented 9 years ago

In case it is helpful, the draft table of character widths that we are currently planning to use can be found in this CharWidths.txt gist (each line of which is codepoints; width) where non-printing characters are assigned a width of 0. This is generated automatically from the unicode 7 tables combined with font metrics from GNU unifont, as described in JuliaLang/utf8proc#27

jquast commented 9 years ago

Interesting, what terminal are you testing the "character cells consumed when printed" on OSX? I too am using OSX, and on iTerm2 it displays as "a-b", consuming 3 characters, so wcwidth would be correct, here... I would need to see evidence of it not forwarding the cell when printed on at least some terminal emulators, and file bugs for the others. Just to be very clear, the purpose of wcwidth is "printable width on a terminal", and not firefox or anything else (for which such character is hidden).

Also, I don't necessarily trust the OS-provided 'wcwidth', they are typically based on very old (5-10 years old) unicode specifications. I have a program I've tested on osx and linux, both are wildly different, and in each case my version was correct: https://github.com/jquast/wcwidth/blob/master/bin/wcwidth-libc-comparator.py

The combining and wide character tables are programmatically updated by "python setup.py update", which is similar to your https://github.com/JuliaLang/utf8proc/pull/27/files#diff-3832b9cfe2fc10d35ac5c63d9b7b8133R20

There is no unicode specification reference tables for 0-width characters that I know of, so its just hardcoded here https://github.com/jquast/wcwidth/blob/master/wcwidth/wcwidth.py#L161-171

jquast commented 9 years ago

Using the 'Cf' category listings on iTerm2, it appears the following all consume 1 character cell, some with symbols, some simply by blanks

00AD
0600
0601
0602
0603
0604
0605
061C
06DD
070F
200E
200F
202A
202B
202C
202D
202E
2060
2061
2062
2063
2064
2066
2067
2068
2069
206A
206B
206C
206D
206E
206F
FFF9
FFFA
FFFB

And the following consume 0 cells:

180E
200B
200C
200D
FEFF

which may indeed need to be supported by wcwidth once i test a few more terminals

stevengj commented 9 years ago

(We don't trust the system-provided wcwidth either, for the same reason as you, which is why we compute the widths independently. However, the OSX 10.10.2 wcwidth agrees with our results when it returns a nonnegative value, so it mostly seems to have errors of omission—it returns -1 for many valid printable characters from recent Unicode standards. Moreover, U+00AD has been part of Unicode since 1993, so I would think that most wcwidth implementations would handle it properly.)

There is an interesting article on the soft hyphen, which apparently has had a controversial history, and is rendered in different ways depending on the font and the rendering system. I'm not sure what the right answer is here, but the Unicode standard seems to somewhat favor the viewpoint that it should be invisible although it leaves it up to the implementation. However, the article mentions that the Unicode FAQ does say In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances and maybe that is what is done in practice.

cc: @jiahao and @StefanKarpinski.

stevengj commented 9 years ago

Note that the Arabic characters U+0601 etc. are defined by the unicode standard as exceptions to usual rule that Cf characters are invisible:

In contrast, e.g. U+200E is a left-to-right mark, and in my understanding is defined as an invisible formatting character that controls the direction of the text. Some terminals may give it a nonzero width (although the MacOS Terminal with the default font gives it zero width on my machine), but that seems like a bug in the terminal (or the font); it seems like it is better to return what the Unicode standard says rather than propagating a particular buggy implementation.

jiahao commented 9 years ago

I remember that article about the soft hyphen. Under "Modern Unicode semantics" it references UAX 14 for Unicode 7.0.0, §5.4, which says:

Unlike U+2010 hyphen, which always has a visible rendition, the character U+00AD soft hyphen (shy) is an invisible format character that merely indicates a preferred intraword line break position. If the line is broken at that point, then whatever mechanism is appropriate for intraword line breaks should be invoked, just as if the line break had been triggered by another hyphenation mechanism, such as a dictionary lookup.

The description in the following paragraphs suggests that the rendering of a soft hyphen is accomplished not by printing the soft hyphen itself, but rather by inserting an additional, printable hyphen glyph:

The inserted hyphen glyph can take a wide variety of shapes, as appropriate for the situation. Examples include shapes like U+2010 hyphen, U+058A armenian hyphen, U+180A mongolian nirugu, or U+1806 mongolian todo soft hyphen.

Based on this description it would seem that the character U+00AD by itself is nonprintable and should have a width of 0 or -1.

stevengj commented 9 years ago

Interestingly, the Unicode FAQ entry that the SHY article quoted seems to no longer exist — from that passage in the Unicode 7.0.0 standard that @jiahao quoted it seems like the Unicode consortium decided to put its foot down and and declare that the soft hyphen is definitely invisible, ISO 8859-1 be damned.

jquast commented 9 years ago

I really appreciate all of the resarch, @stevengj and @jiahao.

My decision is to use the common denominator across the most popular terminal emulators for wcwidth. I might make a note of it in the readme that it deviates from the standard, as the primary purpose of this project is how text is displayed by the most common (utf-8 capable) terminal emulators.

I've made a checklist:

Then, test the following and report:

I'm not sure how to gauge the "popular terminal emulators", this is just from memory.

sidenote: More importantly, how to factor their weight in wcwidth for any given differences: perhaps some way to configure how the printable width of such discrepancies may be reported if the consumer of wcwidth knows their target audience's emulator (unfortunately all such terminals borrow the common value "xterm" or "xterm-256color" as the OS Environment Variable for TERM, and using the response of the "answerback sequence" (^E) which at least PuTTY replies to, but I'm afraid thats far out of scope for wcwidth, it would require interaction with a terminal driver.

Finally, we can make a PR and release any update.

jiahao commented 9 years ago

@jquast thanks for your detailed consideration. As you had stated above, iTerm seems to have different needs from us at this point.

jiahao commented 9 years ago

However, I don't think it is possible to provide consistency across terminal environments without considering also the interactions with the choice of users' fonts. Many fonts simply have wrong advance widths for some code points.

Here is a simple rendering text for the fixed width fonts on my system. Consider

U+003C9 U+00302= \omega\hat =  ω̂

should render with the hat combining character on the omega.

U+00302 U+003C9 = \hat\omega =  ̂ω

should render with a hat to the left of omega.

screen shot 2015-04-21 at 5 39 28 pm screen shot 2015-04-21 at 5 39 37 pm screen shot 2015-04-21 at 5 39 48 pm

jquast commented 9 years ago

You are correct, but terminal emulators don't typically care, they're the ones who handle the width of "printable cells" -- What is your system, is it a terminal emulator?

jiahao commented 9 years ago

The screenshots I pasted were taken from an IPython notebook rendering test HTML using those fonts. I can see the same spacing issues if I manually change the font in OSX Terminal and generate these characters in the Julia console REPL.

jquast commented 9 years ago

Version wcwidth 0.1.5 which includes better combining character width determination by PR #11 is available on pypi.

A terminal sequence may be emitted to illicit the terminal emulator to respond with its cursor position.

This can be used to manually display all questionable characters across different popular Font face profiles and terminal emulators, and programatically determine whether they consider it 0 width for such characters, making a report of the most common discrepenancies, weighing on the side of "most correct", resolving any.

jquast commented 1 year ago

This is closed by https://github.com/jquast/wcwidth/pull/91

avih commented 7 months ago

For reference, in glibc wcwidth(0xad) appears to be 1.

Judging by this discussion: https://sourceware.org/bugzilla/show_bug.cgi?id=22073 which concluded that it should be 1.

That discussion took place in 2017 - after the main discussion in this issue, but before the last https://github.com/jquast/wcwidth/issues/8#issuecomment-1785907194 here.

Also, in musl-libc, 0xad is also of wcwidth 1.

jquast commented 7 months ago

It's a bit ambiguous isn't it? From https://codepoints.net/U+00AD,

is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they are fall on the line end but remain invisible within the line.

I will add a test to ucs-detect and whichever measured width (0 or 1) that is used among the most popular and compliant terminals will be used in this library.

avih commented 7 months ago

For what it's worth, the musl-libc maintainer, @richfelker said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption".

Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?).

Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc.

And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice.

jquast commented 7 months ago

Thanks for relaying @richfelke https://github.com/richfelker‘s thoughts, I’m in full agreement with all of them, especially for -1 as this kind of character is meant to be managed by the terminal emulator, and it’s width is indeterminate (like \n, \t, etc). But if the most popular terminal emulators measure it as width of 1 then I’d like to match

-- Jeff Quast @.***

On Wed, Mar 13, 2024, at 2:53 PM, avih wrote:

For what it's worth, the musl-libc maintainer, @richfelker https://github.com/richfelker said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption".

Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?).

Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc.

And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice.

— Reply to this email directly, view it on GitHub https://github.com/jquast/wcwidth/issues/8#issuecomment-1995380988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNOKBLY2HT5NLTDSE46HDYYCOC7AVCNFSM4A5VFAVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJZGUZTQMBZHA4A. You are receiving this because you modified the open/close state.Message ID: @.***>

avih commented 7 months ago

if the most popular terminal emulators measure it as width of 1 then I’d like to match

Right.

I would guess that terminals measure its width according to the wcwidth implementation which they use? And I would also guess that typically that would be whatever libc provides? (not including windows terminal, which brings its own implementation, because on windows there's no system wcwidth).

And so ultimately, I would think the goal should be agreement between wcwidth implementations, rather than between this implementation and the behavior of popular terminal emulators?

stevengj commented 7 months ago

Ultimately, the utf8proc library decided to also report a width of 1 for U+00AD as well, in order to agree with other wcwidth implementations, and with typical terminal programs which display a soft hyphen as a visible - glyph.

avih commented 7 months ago

I would guess that terminals measure its width according to the wcwidth implementation which they use? And I would also guess that typically that would be whatever libc provides?

Well, that was not a good argument, and I would agree that if this was the only or main wcwidth implementation, then it should try to match the common terminal emulators behavior.

But because this is one of several wcwidth implementations, its goal should be to agree with other wcwidth implementations rather the terminals.

That being said, it would still be nice to know how terminals handle it.

At which case, the test should be dual:

I would guess that most terminals don't handle it dually like the Unicode semantics suggests (and would imply a -1 wcwidth value), hence they probably treat it as always 1 or always 0, though that's a guess.

avih commented 7 months ago

At which case, the test should be dual...

So, I tested it in the following terminals on Alpine linux 3.19.1, and all the tested terminal emulators treat it either as hard 0 or hard 1. I.e. no terminal handles it dually as 0 at the middle of the line and hyphen+wordbreak in a word which spills over the end of the line.

Specifically, I tested using this script, and observed the result on-screen (not automated). the SHY byte is always at this word xxx<SHY>yyy:

EDITED: THIS SCRIPT IS BROKEN AND THE RESULTS ARE INVALID. See fixed script at the next post.

test-shy.sh (broken) ```sh #!/bin/sh dots() { R= while [ ${#R} -lt $1 ]; do R=$R.; done echo "$R" } has() { command -v "$1" >/dev/null; } nth() { shift $1; printf %s\\n "$1"; } cols() { if [ "${COLUMNS-}" ]; then echo $COLUMNS elif has stty; then nth 2 $(stty size) elif has ttysize; then nth 1 $(ttysize) else echo 80; fi } cols=$(cols) printf "$(dots $cols)\n\n" printf "SHY mid line: aaa xxx\255yyy bbb\n\n" printf "no SHY: $(dots $((cols - 16))) aaa xxxyyy bbb\n\n" printf "SHY before last column: $(dots $((cols - 34))) aaa xxx\255yyy bbb\n\n" printf "SHY at the last column: $(dots $((cols - 33))) aaa xxx\255yyy bbb\n\n" ```

All the terminals were invoked with UTF-8 locale, e.g.:

LC_ALL=en_US.UTF-8 xterm

Results:

xterm 388, VTE (tested {gnome,xfce4,lx}-terminal), konsole 23.08.4, and st 0.9: always display it as U+FFFD REPLACEMENT CHARACTER, as if wcwidth(0xad) == 1:

shy-xterm

urxvt: similat to xterm etc. above, but always displays it as a hyphen, as if wcwidth(0xad) == 1.

alacritty 0.12.3 and kitty 0.31.0: seem to ignore it at the input, as if wcwidth(0xad) == 0:

shy-alacritty

So while 1 is common, I don't think it's black and white.

So I would think the goal should be to match other wcwidth implementations, where the value appears to be 1 at least in glibc, musl, and utf8proc.

avih commented 7 months ago

Actually, the test script above is wrong. It printed the byte 0xad (which is invalid UTF-8 sequence) rather than the UTF-8 sequence for U+00AD - which is 0xc2 0xad.

This is the revised script:

fixed test-shy.sh ```sh #!/bin/sh sf="\302\255" # printf fmt of UTF-8 of U+00AD SOFT-HYPHEN dots() { R= while [ ${#R} -lt $1 ]; do R=$R.; done echo "$R" } has() { command -v "$1" >/dev/null; } nth() { shift $1; printf %s\\n "$1"; } cols() { if [ "$COLUMNS" ]; then echo $COLUMNS elif has stty; then nth 2 $(stty size) elif has ttysize; then nth 1 $(ttysize) else echo 80; fi } cols=$(cols) printf "$(dots $cols)\n\n" printf "SHY mid line: aaa xxx${sf}yyy bbb\n\n" printf "no SHY: $(dots $((cols - 16))) aaa xxxyyy bbb\n\n" printf "SHY before last column: $(dots $((cols - 34))) aaa xxx${sf}yyy bbb\n\n" printf "SHY at the last column: $(dots $((cols - 33))) aaa xxx${sf}yyy bbb\n\n" ```

And these are the results at the various terminals (kitty doesn't have "kitty" at the title, and xfce4-terminal and gnome-terminal have the same result as lxterminal - as all are VTE-based): soft-hyphen-terminals

Like before, this is on Alpine linux 3.19.1 with the terminals installed from the distro packages repository, and all terminals were invoked after exporting LC_ALL=en_US.UTF-8.

Results:

avih commented 1 month ago

Here's a summary of the U+00AD SOFY-HYPHEN behavior:

Therefore I think it should be added/restored as an overriding exception - return 1 for 0x00ad, to reflect terminals behavior and align with other wcwidth implementations.

original comment by Markus Kuhn from the linked file: ```c /* The following two functions define the column width of an ISO 10646 * character as follows: * * - The null character (U+0000) has a column width of 0. * * - Other C0/C1 control characters and DEL will lead to a return * value of -1. * * - Non-spacing and enclosing combining characters (general * category code Mn or Me in the Unicode database) have a * column width of 0. * * - SOFT HYPHEN (U+00AD) has a column width of 1. * * - Other format characters (general category code Cf in the Unicode * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0. * * - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) * have a column width of 0. * * - Spacing characters in the East Asian Wide (W) or East Asian * Full-width (F) category as defined in Unicode Technical * Report #11 have a column width of 2. * * - All remaining characters (including all printable * ISO 8859-1 and WGL4 characters, Unicode control characters, * etc.) have a column width of 1. ... ```
stevengj commented 4 weeks ago

utf8proc now returns 1 as well (https://github.com/JuliaStrings/utf8proc/pull/135).