term.ljust, center etc. incorrect for sequences containing U+FE0F (Variation Selector-16)

jquast / blessed

Blessed is an easy, practical library for making python terminal apps

http://pypi.python.org/pypi/blessed

MIT License

1.21k stars 72 forks source link

term.ljust, center etc. incorrect for sequences containing U+FE0F (Variation Selector-16) #267

Open dscrofts opened 8 months ago

dscrofts commented 8 months ago

Example:

from blessed import Terminal

term = Terminal()
strings = ["123", "456", "🗣️  "]

print("with term.ljust:")
for string in strings:
    print(f"{term.ljust(string, 5)} 1")

print("without term.ljust:")
for string in strings:
    print(f"{string:<5} 1")

Output (term.ljust adds one additional cell):

with term.ljust:
123   1
456   1
🗣️     1
without term.ljust:
123   1
456   1
🗣️    1

However this is not consistent with all unicode sequences. For example, changing strings to ["123", "456", "🤔 "] gives:

Output (term.ljust padding is correct):

with term.ljust:
123   1
456   1
🤔    1
without term.ljust:
123   1
456   1
🤔     1

jquast commented 8 months ago

Hello, thanks for the report.

I was aware of this issue but there was no bug to track it. I could probably add a simple workaround here in blessed so I will try to do that soon.

I recently added support for Variation Selector-16 (U+FE0F) into wcwidth. But the way that blessed uses this library still gets the calculation wrong (adding each individual codepoint together from wcwidth.wcwidth() function).

I might,

add the functionality of interpreting terminal sequences directly into wcwidth library which blessed will directly offload to https://github.com/jquast/wcwidth/issues/93
or a "grapheme clustering" functionality to wcwidth that blessed should use
or just make blessed do the "grapheme clustering" necessary to account for these correctly

Correct accounting for Emoji that includes U+FE0F is difficult, only 7 terminals support it at last check, i wrote more about it here https://www.jeffquast.com/post/ucs-detect-test-results/, and I've gotten pushback from libvte author used in terminals like Gnome, they refuse to support it at all https://gitlab.gnome.org/GNOME/vte/-/issues/2580 so i've been a bit distracted just trying to get terminal emulators to support it, rather than having blessed support it, but I will definitely get to it soon.

jquast commented 8 months ago

Also to add, I could tell this included U+FE0F by the following commands,

>>> import unicodedata
>>> list(map(unicodedata.name, '🗣️  '))
['SPEAKING HEAD IN SILHOUETTE', 'VARIATION SELECTOR-16', 'SPACE', 'SPACE']
>>> list(map(hex, map(ord, '🗣️  ')))
['0x1f5e3', '0xfe0f', '0x20', '0x20']

jquast commented 8 months ago

Also to add, that python's built-in formatting gets this horribly wrong, it's not aware of emojis, terminal sequences, or even basic east-asian characters like Chinese or Japanese, but in your case it just happens to accidentally get it right :)

I wrote an issue about what it might take to get python's built-in formatting to just account for emoji correctly, https://github.com/jquast/wcwidth/issues/94

jquast commented 5 months ago

Just to add, I added some tests in #275 around ZWJ, pointing out that it gets it wrong. I will continue to work towards a solution for this, I think the wcwidth library needs a kind of iterative parser to correctly solve this in a way that can be integrated into blessed.