jquast / wcwidth

Python library that measures the width of unicode strings rendered to a terminal
Other
393 stars 58 forks source link

Bengali / Khmer / Gujarati / Odia / Hindi regression? #126

Closed masaccio closed 4 months ago

masaccio commented 4 months ago

I recently updated from 0.2.6 to 0.2.13 and I have some tests breaking in a package that uses wcswidth. The following test fails every check in 0.2.13 but passes in 0.2.6:

from wcwidth import wcswidth

def check(length, value):
    if wcswidth(value) == length:
        print(value, "OK")
    else:
        print(value, "FAIL")

check(10, '"আবখাজিয়া"')
check(12, '"আফগানিস্তান"')
check(10, '"আলবেনিয়া"')
check(10, '"अबख़ाज़िया"')
check(11, '"ඇෆ්ගනිස්ථානය"')
check(10, '"ඇල්බේනියාව"')
check(12, '"अफ़ग़ानिस्तान"')
check(12, '"ஆக்கானித்தான்"')
check(14, '"អាហ្វហ្គានីស្ថាន"')
check(12, '"અફગાનિસ્તાન"')
check(9, '"અલ્બેનિયા"')
check(12, '"ଆଫାଗାନିସ୍ତାନ୍"')
check(13, '"गन्धार, अश्वक"')
check(10, '"अल्बानिया"')

Aligning some ASCII text in my terminal, I believe that the check lengths are correct:

Screenshot 2024-05-15 at 15 13 55
masaccio commented 4 months ago

Looks like 0.2.9 is where the change happened.

jquast commented 4 months ago

@masaccio which terminal?

jquast commented 4 months ago

https://github.com/jquast/wcwidth/pull/91#issuecomment-1785693243 related issue and comment about the change

masaccio commented 4 months ago

This was iTerm2 on a Mac with TERM=xterm-256color

masaccio commented 4 months ago

This is the text I expect to see aligned - https://raw.githubusercontent.com/masaccio/compact-json/main/tests/data/test-issue-4.ref-1.json

Though in my browser it's not aligned so I don't know what the right answer is.

jquast commented 4 months ago

I will say that I also use iTerm2, and that it is not a great indicator of multilanguage support. I have since authored a testing and reporting tool, ucs-detect, and have published results for ~27 terminals.

The following terminals match this library's measurements for Hindi:

The other ~23 terminals, including iTerm2, do not. iTerm2 gets an overall score of "B" rating for LANG score while the ones listed above get A's.

Some of them are systematic errors and I may create bug reports for their respective projects. However, languages like Hindi of script Devanagari are very excessive with combining characters (Category codes Mc and Mn), and, strictly following the Unicode Specifications, as these 4 terminals and this library do, may result in so much "squeezing" to be totally illegible!

On your findings of the browser, I have found that they do not make the effort to align by column as a terminal is expected to (see screenshots in https://github.com/jquast/wcwidth/issues/123#issuecomment-2028115594)

I have authored a dummy "check" function to display a sequence where '|' should align,

def check(n, phrase):
     print('|'+(' '*wcwidth.wcswidth(phrase))+'|'+'\n'+'|'+phrase+'|\n')

And these are the results for iTerm (left) and WezTerm (right)

image

I don't know Devanagari enough to say for sure, I would say that iTerm2 appears to fail to correctly combine characters of category Mc and Mn, while wezterm does combine them but also sometimes reduces the font size to accommodate their expected width and maybe some combining characters are also poorly aligned

masaccio commented 4 months ago

Thanks for the comprehensive debug. I can see I'm staring a large rabbit hole of encodings I don't understand so I'll step away! Wezterm does indeed agree with your library (though not editing in vim) and that is enough for me.