selectel / pyte

Simple VTXXX-compatible linux terminal emulator
http://pyte.readthedocs.org/
GNU Lesser General Public License v3.0
658 stars 102 forks source link

Strange regression with some unicode characters (e.g. with the Russian Н) #65

Closed chubin closed 7 years ago

chubin commented 7 years ago

pyte 0.6 has a strange regression with some Unicode characters, particularly with the Russian "Н" character:

That works:

$ cat  regression.py
# vim: encoding=utf-8
import sys
import pyte

text = "Русский текст"
screen = pyte.screens.Screen(20, 1)
stream = pyte.streams.ByteStream()
stream.attach(screen)
stream.feed(text)

for line in screen.buffer:
    for x in line:
        sys.stdout.write(x.data)
    sys.stdout.write("\n")

$ python regression.py
Русский текст

That does not work:

$ cat  regression.py
# vim: encoding=utf-8
import sys
import pyte

text = "Нерусский текст"
screen = pyte.screens.Screen(20, 1)
stream = pyte.streams.ByteStream()
stream.attach(screen)
stream.feed(text)

for line in screen.buffer:
    for x in line:
        sys.stdout.write(x.data)
    sys.stdout.write("\n")

$ python regression.py

$

As you can see, the output is empty in the second example (where the printed text contains "Н").

Everything works find with the 0.5.x version of the module.

Another problematic character: greek letter Ν

Some other broken characters:

ț \u021b
ȝ \u021d
ɛ \u025b
ɝ \u025d
ʛ \u029b
ʝ \u029d
̛ \u031b
̝ \u031d
͛ \u035b
͝ \u035d
Λ \u039b
Ν \u039d
Л \u041b
Н \u041d
ћ \u045b
ѝ \u045d
қ \u049b
ҝ \u049d
ԛ \u051b
ԝ \u051d
՛ \u055b
՝ \u055d
֛ \u059b
֝ \u059d

1b, 1d, 5b, 5d, 9b, 9d seem to be the root of the problem

superbobry commented 7 years ago

Thanks for reporting! This could be related to #62. Will investigate further.

chubin commented 7 years ago

I have found a new group of the evil characters. Unfortunately, this group seems to have nothing common with the former group:

҃ \u0483
҄ \u0484
҅ \u0485
҆ \u0486
҇ \u0487
superbobry commented 7 years ago

The issue indeed has the same cause as #62. All of the characters you've listed contain some control bytes when UTF-8 encoded, e.g.

>>> "Н".encode("utf-8")
b'\xd0\x9d'  # \x9d is OSC
>>> "қ".encode("utf-8")
b'\xd2\x9b'  # \x9b is CSI
chubin commented 7 years ago

Of course they have, I listed some of them with their codes and they indeed contain 9d and 9b as you can see. On the other hand, in the last block I listed another group of characters, those do not contain neither 9d nor 9b. That seem to be another problem

superbobry commented 7 years ago

The new "unprintable" group seems to be related to the way we do Unicode normalization as all of them (I think) are combining characters.

chubin commented 7 years ago

How do you think, are there any chances that the bug will be fixed in the next weeks? Or should I better downgrade pyte and use 0.5.2? Can I help somehow probably?

superbobry commented 7 years ago

The bug is a consequence of delegating input decoding to Screen (see febdad70ba4b0eec509e1cf10d9ed2d9fb284e85). I am currently thinking about how to best approach this, can't guarantee the fix would arrive shortly.

If you have any ideas, feel free to share them here.

chubin commented 7 years ago

I can try to find some other broken characters if it can help

superbobry commented 7 years ago

Don't worry, the ones you already came up with are already enough.

chubin commented 7 years ago

Any news about the issue may be? The problem is that many Japanese/Chinese are also corrupted. There are some simple workaround for Cyrllic/Greek, but things are getting worse with the oriental languages. So the issue is a real blocker for pyte 0.6 usage in a multilingual environment

superbobry commented 7 years ago

I am still thinking on how to implement this without making the code too much of a nightmare. I have a prototype in a local branch but it is not finished yet. Most likely I won't have much time to work on this further until the next weekend, so if you have any ideas feel free to post them here or submit a PR.

So the issue is a real blocker for pyte 0.6 usage in a multilingual environment

Yes, I understand it is critical, but 0.6.0 has not been released, so I'd suggest to use the latest stable version if you're after correctness.

chubin commented 7 years ago

I confirm the problem is fixed now! @superbobry you are genius! Thank you very much!

superbobry commented 7 years ago

Haha, thanks! Glad it works for you :)