Weirdness When Reviewing Non-Ascii Unicode Characters

This was discovered on Linux; I have no idea if it also affects Mac.

Sometimes tdsr or pyte interprets console output as something other than UTF-8.

$ python3

print('\xe4')

You'll hear tdsr say a umlaut, and that's what was printed. Use the review keys to review the line of output, and you'll hear something completely different: sigma. It turns out that the byte 0xe4 is sigma in the old CP-437 character set.

print('\u0134')

You'll hear j circumflex, which is what was printed. Review the line of output by character, and indeed, capital j circumflex is what is there.

So it's as though for unicodes under 0x100, pyte (or something else) is treating their least significant byte as a character in CP-437 and then translating them to UTF8 to be spoken, whereas unicode characters >= 0x100 are handled properly.

tspivey / tdsr

Weirdness When Reviewing Non-Ascii Unicode Characters #13