jquast / wcwidth

Python library that measures the width of unicode strings rendered to a terminal
Other
390 stars 58 forks source link

Wrong width for Hindi on macOS, but correct width on Linux #25

Closed tleonhardt closed 11 months ago

tleonhardt commented 6 years ago

I tried using wcwidth to calculate the length of the name for the city of Mumbai in Hindi (बॉम्बे हिंदी)

from wcwidth import wcswidth
wcswidth('बॉम्बे हिंदी')
9

On macOS 10.13.5 using Python 3.6.5, I see a visual width of 5 characters and a calculated width of 9 characters.

On Ubuntu 18.04 using Python 3.6.5, I see a visual width of 9 characters and a calculated width of 9 characters.

Thank you by the way for creating a very useful module!

jquast commented 6 years ago

Thank you very much for the specific example, definitely a bug to try to address !

tleonhardt commented 6 years ago

In case it is important for future troubleshooting, I was using iTerm 2 version 3.1.6 with Bash version 4.4.23

shreevatsa commented 5 years ago

Almost surely, the "correct" width on Linux is because of a broken terminal that does not display the characters correctly. On macOS, the terminal uses the system libraries for text rendering, and properly renders the combining characters. On Linux, the terminal does something horrible.

Here's a much simpler test case:

>>> from wcwidth import wcswidth
>>> wcswidth('का')
2

That's U+‎0915 DEVANAGARI LETTER KA followed by U+‎093E DEVANAGARI VOWEL SIGN AA

It is supposed to be displayed as a single grapheme (e.g. you should not be able to place the cursor between them to type, and definitely you should not see a dotted circle), but on Linux terminals I get very weird results, with क in one cell, and ा in another cell (with the dotted circle).

The wcswidth result of "2" is consistent with this weird result, but obviously incorrect.

I imagine it's similarly broken for all Indic scripts. In fact I cannot imagine how something like wcwidth, which only returns integers, is going to work for Indic, Arabic, etc scripts. Searching for [wcwidth indic] brings up some results like https://github.com/mintty/mintty/issues/553 https://github.com/xtermjs/xterm.js/issues/1468 https://github.com/xtermjs/xterm.js/issues/72 and see the mentions of "Indic" at https://www.cl.cam.ac.uk/~mgk25/unicode.html -- looks like a lot of software is broken.

jquast commented 5 years ago

Hello new conversationalist :)

I know the details pretty well, but re-reading the last link for “Indic” somewhere in there makes pretty dire “no support”, anyway the terminal landscape is very rich today, but mostly libc/wcwidth(3) based with little font intelligence, but combining more than has been written here in past years, improving due to emoticon situation I suppose, folks want silly icons aligned.

I personally saw Hindi has the go-to Test characters, I have a combining browser in the repo, you’re welcome to use to modify and view “Indic” language effects,

looks like a lot of software is broken

Well, we can fix it! This library had some light use in terminal landscape for python languages and Python is very readable, folks may be copying us already and I hope they may, into the next future wcwidth(), so please join if you can help :)

I have a scheme for automatically detected Unicode support level by introspection using terminal report cursor position query..

Anyway if anyone is interested in taking on the bulk of the work I will transfer knowledge and always accept test-passing PR’s that make sense.

Otherwise I’m FOSS retired and unable to put in the hours, sorry and good luck

jquast commented 4 years ago

Wow, this really screws up in iTerm2, especially if navigating a cursor around the text, which is one of the better terminals for these things. In iTerm2, it is displayed in 1 cell, not 2, but wcwidth determines 2 for this example.

Zarainia commented 3 years ago

I am having similar problems with various WSL terminals. Gnome Terminal under an X-server or WSLg displays (Latin letter a + combining accent) as width 1 (which is what wcwidth returns). Entering WSL from Windows, Windows Terminal and the default bash terminal display it with width 2, Hyper and Terminus displays it as width 1, but ucs-detect detects Unicode version as 13.0.0 for all of them, despite the results being different.

dscrofts commented 2 years ago

I too am having issues with this. Here's another specific example:

from wcwidth import wcswidth
wcswidth("चाइनीज")
6

When viewing this using the Kitty terminal, the text only occupies 4 columns, not 6. This seems to be a problem in general with most Hindi text that I've encountered.

Tested on macOS Big Sur 11.6.5 and Kitty 0.24.4

jquast commented 11 months ago

Zero-width characters used with the Hindi language have been resolved in today's release by #91.

I created a testing tool that verifies it, that at least in the case of "Universal Declaration of Human Rights" from https://unicode.org/udhr/ in Hindi, that wcwidth now agrees in measurement of all words with "kitty" and "mlterm" terminals.

dscrofts commented 10 months ago

Zero-width characters used with the Hindi language have been resolved in today's release by #91.

I created a testing tool that verifies it, that at least in the case of "Universal Declaration of Human Rights" from https://unicode.org/udhr/ in Hindi, that wcwidth now agrees in measurement of all words with "kitty" and "mlterm" terminals.

Just found another issue (perhaps a corner-case?) with Hindi:

>>> from wcwidth import wcswidth
>>> wcswidth("गीत")
3

This sequence should only occupy 2 cells.

jquast commented 10 months ago

@dscrofts thank you for your persistence, I really do appreciate your help with Hindi!

Can you please check your version of wcwidth is the latest, 0.2.12? This is measured as 2 in my tests.

Just to be sure, here is my test session,

>>> import unicodedata, wcwidth
>>> wcwidth.__version__
'0.2.12'
>>> l='गीत'
>>> wcwidth.wcswidth(l)
2
>>> ', '.join([unicodedata.name(x).title() for x in l])
'Devanagari Letter Ga, Devanagari Vowel Sign Ii, Devanagari Letter Ta'
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mc', 'Lo']
dscrofts commented 9 months ago

@dscrofts thank you for your persistence, I really do appreciate your help with Hindi!

Can you please check your version of wcwidth is the latest, 0.2.12? This is measured as 2 in my tests.

Just to be sure, here is my test session,

>>> import unicodedata, wcwidth
>>> wcwidth.__version__
'0.2.12'
>>> l='गीत'
>>> wcwidth.wcswidth(l)
2
>>> ', '.join([unicodedata.name(x).title() for x in l])
'Devanagari Letter Ga, Devanagari Vowel Sign Ii, Devanagari Letter Ta'
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mc', 'Lo']

@jquast Thanks for the quick response!

So I ran your code and get identical output. I think there was an issue copy/pasting the characters in my terminal that lead to the wrong output!

After further investigation, it looks like my issue is rather with truncating the text. Specifically, if it is in the middle of a Hindi sequence. I try to left justify the text to a given width, but it seems to be breaking. This might be a good candidate for https://github.com/jquast/wcwidth/issues/93 to implement. I did see there was support for this previously? Any hints as to how I might go about handling/implementing this?

jquast commented 9 months ago

it looks like my issue is rather with truncating the text. Specifically, if it is in the middle of a Hindi sequence

Yes, there would be problems with breaking up a sequence that contains combining characters. It sounds like you are not writing a "left justify" function, but maybe a text wrapping function?

If I write a "wc_ljust" function after the one in the readme, there is no opportunity for truncation. It only appends spaces to fill to the given width, just like the built-in str.ljust() or string formatting like f'{var:<10} does not truncate text:

        def wc_ljust(text, length, padding=' '):
            from wcwidth import wcswidth
            return text + padding * max(0, (length - wcswidth(text)))

The python textwrap module tries to break strings only at whitespace, but the default argument break_long_words=True allows it to also break "long words" into pieces if they are very long, and python's textwrap does not make any effort to take combining characters into account, it does not understand wide or zero-width characters it all.

I did see there was support for this previously?

We have only ever provided the "wc_rjust" example in the readme file for this project.

In issue #93 I am referring to a terminal library of mine, blessed, that has these functions (ljust, rjust, center, and wrap). I think all of these functions would handle Hindi correctly by using break_long_words=False with wrap() to ensure it will not break sequences at combining characters if you just want to try/copy from that -- but please be aware that it will fail with ZWJ emojis and emojis with VS-16 sequences, an example:

>>> import blessed
>>> inp='क़ानून की निग़ाह में सभी समान हैं और सभी बिना भेदभाव के समान क़ानूनी सुरक्षा केस घोषणा का अतिक्रमण करके कोई भी भेद-भाव किया जाया उस प्रकार के भेद-भाव को किस, तो उसके विरुद्ध समान संरक्षण का अधिकार सभी को प्राप्त है ।'
>>> lines=blessed.Terminal().wrap(inp, 4, break_long_words=False)
>>> print('-|-'.join(lines))  # display word break locations with '-|-'
print('-|-'.join(lines))
क़ानून-|-की-|-निग़ाह-|-में सभी-|-समान-|-हैं और-|-सभी-|-बिना-|-भेदभाव-|-के-|-समान-|-क़ानूनी-|-सुरक्षा-|-के-|-अधिकारी-|-हैं ।-|-यदि-|-इस-|-घोषणा-|-का-|-अतिक्रमण-|-करके-|-कोई भी-|-भेद-भाव-|-किया-|-जाया-|-उस-|-प्रकार-|-के-|-भेद-भाव-|-को किसी-|-प्रकार-|-से-|-उकसाया-|-जाया,-|-तो-|-उसके-|-विरुद्ध-|-समान-|-संरक्षण-|-का-|-अधिकार-|-सभी को-|-प्राप्त-|-है ।
>>> print(list(map(wcwidth.wcswidth, lines))) # display length of each line
[3, 1, 3, 4, 3, 4, 2, 2, 4, 1, 3, 3, 4, 1, 4, 3, 2, 2, 3, 1, 6, 3, 4, 5, 2, 2, 2, 4, 1, 5, 4, 4, 1, 4, 3, 1, 3, 4, 3, 5, 1, 4, 4, 4, 3]

Requesting the words to be broken at width of 4 withbreak_long_words=False, the blessed.Terminal.wrap() function will not attempt to break those words any shorter, preventing it from truncating any words especially at combining marks.

dscrofts commented 9 months ago

@jquast you are absolutely correct, I should be using text wrapping instead of ljust. In fact, a combination of both is what I need to have things line up correctly. Funnily enough I am already using blessed in my project, so now I ljust the wrap()'d text and all is working great. Thanks for the help and all your hard work with wcwidth and blessed :)