jquast / wcwidth

Python library that measures the width of unicode strings rendered to a terminal
Other
393 stars 58 forks source link

Should wcwidth have "Treat ambiguos-width as wide" option? #123

Open keatonLiu opened 5 months ago

keatonLiu commented 5 months ago
import wcwidth

if __name__ == '__main__':
    print(wcwidth.wcswidth("①你好"))
    print(wcwidth.wcswidth("你好啊"))

results in: image But it displays 2 character width in monospace font: image image

jquast commented 5 months ago

Which terminal emulator are you using in this example?

For iTerm2, this is correct,

image

As well as WezTerm,

image

And also Kitty,

image

keatonLiu commented 5 months ago

Maybe wcwidth only focus on terminal font? I'm using PrettyTable to generate table, which depends on wcwidth, and I want to display the table text on browser. For example, I'm using chrome, and I found monospace fonts works fine most of the time. But for some unicode words, it displays with a different length.

keatonLiu commented 5 months ago

It will be helpful if I can provide the font family and get a more general result. Is it possible?

jquast commented 5 months ago

wcwidth is primarily focused for terminals, that is if browsers and terminals disagree we would rather match with terminals. Although I expect a javascript or browser-based library that is more focused on browser width, I cannot find one at this moment, please suggest if you do.

Browsers are able to communicate directly with the font engine of the operating system, while wcwidth in python and other languages are not, so we generally take a more naive approach. And this is probably why most terminals are also wrong in this case while browsers are not.

In this case, the problem with ① (https://codepoints.net/U+2460) is that it is Ambiguous width (https://unicode.org/reports/tr11/#Ambiguous) and,

They have a “resolved” width of either narrow or wide depending on the context of their use.

In the following code blocks I use the same character, one with english letters on the same line,

①2345
12345

and another of your example with your Mandarin Chinese "hello",

①你好
12345

Although they render differently sized, at least on my browser (Firefox 120.0.1), they have approximately the same width. I will say that monospace fonts do not always align vertically in browsers (note how the number '5' does not align in the first example), while they always do in terminals.

Screenshot of the above, image (End screenshot)

It would require more experimentation, but maybe for a page of Chinese locale it would render differently, such as in your original screenshot, I'm not really sure.

In any case, there are options on many terminals, to cause ambiguous width characters to display as 2 cells,

I'm not certain, but maybe this option is more frequently used for east-asian language users in terminals?

But it is very problematic -- the entire software stack needs to agree to "treat ambiguous width as wide", for example, here is an "$LD_PRELOAD-able library and a wrapper script" that patches posix wcwidth for this option, and references many issues and bugs about this option. https://github.com/fumiyas/wcwidth-cjk

The "Terminal Working Group" tried to come to a consensus about this and other issues, https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9#note_406682 -- there was a great deal of discussion but this "Working Group" specifications project has failed to come to any consensus at all on any single issue (the "accepted" folder is empty, 31 open issues)

And, maybe this library could also provide such an option, to "treat ambiguous width as wide". And, I will rewrite this github issue to match that request.

GalaxySnail commented 5 months ago

It's also rendered with a width of 1 in Windows Terminal.

Even more, it's rendered with a width of 1 in my webbrowser (chromium).

I personally agree that "①" should be East Asian Wide, but unfortunately it is East Asian Ambiguous (and a similar character U+2780 is East Asian Neutral). In my opinion, it may need to be addressed in Unicode, but I'm not sure. Unicode is a bit chaotic. ¯\_(ツ)_/¯

keatonLiu commented 5 months ago

Thank you for so much work! You are very helpful. I have tested that in my windows terminal and gets the same result. image

I understand it is because the ① character is an East Asian Ambiguous character, which is treated as different size in different context. I agree that it can have a "treat ambiguous width as wide" option because in most cases it displays the same size as a east asian character in my locale. You can visit this website and get an intuitive demo: https://www.zhonghuazidian.com/zi/%E2%91%A0 On my browser, chrome: image Even in Word: image I think it will be a wide width character if you use a monospace font-family in browser.

keatonLiu commented 5 months ago

wcwidth is primarily focused for terminals, that is if browsers and terminals disagree we would rather match with terminals. Although I expect a javascript or browser-based library that is more focused on browser width, I cannot find one at this moment, please suggest if you do.

Browsers are able to communicate directly with the font engine of the operating system, while wcwidth in python and other languages are not, so we generally take a more naive approach. And this is probably why most terminals are also wrong in this case while browsers are not.

In this case, the problem with ① (https://codepoints.net/U+2460) is that it is Ambiguous width (https://unicode.org/reports/tr11/#Ambiguous) and,

They have a “resolved” width of either narrow or wide depending on the context of their use.

In the following code blocks I use the same character, one with english letters on the same line,

①2345
12345

and another of your example with your Mandarin Chinese "hello",

①你好
12345

Although they render differently sized, at least on my browser (Firefox 120.0.1), they have approximately the same width. I will say that monospace fonts do not always align vertically in browsers (note how the number '5' does not align in the first example), while they always do in terminals.

Screenshot of the above, image (End screenshot)

It would require more experimentation, but maybe for a page of Chinese locale it would render differently, such as in your original screenshot, I'm not really sure.

In any case, there are options on many terminals, to cause ambiguous width characters to display as 2 cells,

I'm not certain, but maybe this option is more frequently used for east-asian language users in terminals?

But it is very problematic -- the entire software stack needs to agree to "treat ambiguous width as wide", for example, here is an "$LD_PRELOAD-able library and a wrapper script" that patches posix wcwidth for this option, and references many issues and bugs about this option. https://github.com/fumiyas/wcwidth-cjk

The "Terminal Working Group" tried to come to a consensus about this and other issues, https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9#note_406682 -- there was a great deal of discussion but this "Working Group" specifications project has failed to come to any consensus at all on any single issue (the "accepted" folder is empty, 31 open issues)

And, maybe this library could also provide such an option, to "treat ambiguous width as wide". And, I will rewrite this github issue to match that request.

Interesting, I'm using chrome and displays in another way: image