microsoft / terminal

The new Windows Terminal and the original Windows console host, all in the same place!
MIT License
95.25k stars 8.27k forks source link

Incorrect encoding although code page is correct #16708

Open cktgh opened 8 months ago

cktgh commented 8 months ago

Windows Terminal version

1.18.10301.0

Windows build number

10.0.19045.0

Other Software

PowerShell 7.4.1 Diagon (diagon.exe from diagon-1.1.156-win64.zip)

Steps to reproduce

This is an issue that occurs on Terminal but not Conhost.

Executing the command diagon Math -- "1+1/2 + sum(i,0,10) = 112/2" would output:

        10         
        ___        
    1   ╲       112
1 + ─ + ╱   i = ───
    2   ‾‾‾      2 
         0         

The output is correct on UTF-8 (65001) and would be broken on "Multilingual (Latin I)" (850).

When incorrect in encoding:

        10         
        ___        
    1   Ôò▓       112
1 + ÔöÇ + Ôò▒   i = ÔöÇÔöÇÔöÇ
    2   ÔÇ¥ÔÇ¥ÔÇ¥      2 
         0         

Expected Behavior

Output should be correct when encoding is UTF-8 (setting code page to 65001)

Actual Behavior

The output is incorrect on Terminal even when the encoding is UTF-8. I have verified this with chcp and the code page is 65001

Screenshots for reference

Conhost (correct behaviour)

cmd_NLvESW58q5

WIndows Terminal (incorrect behaviour)

WindowsTerminal_U13vuiXMJk

cktgh commented 8 months ago

I am unsure as to whether this would help identifying the issue, but here are a few thing I noticed:

lhecker commented 8 months ago

I must be doing something wrong image

lhecker commented 8 months ago

I noticed that you're using Windows 10. Thinking about this some more, this may be an issue with Windows 10, since its CRT (C/C++ stdlib) version is older, and the CRT never had excellent Unicode support. It's much better now, but it's still somewhat broken for surrogate pairs and such. I believe this is the case, because this issue doesn't reproduce for me on Windows 11, and it doesn't make any sense why chcp would have no effect for you. chcp definitely does work correctly, because I know that it changes the result of GetConsoleOutputCP and we haven't touched either of the two in a decade.

In other words, I think this may be an CRT issue. You could potentially fix it by compiling diagon yourself with the latest version of the Windows SDK. Potentially you need to statically link the CRT.

I'll leave this issue open because I can't really prove that it's due to the CRT.

cktgh commented 8 months ago

I noticed in your screenshot that the output is correct on code page 437, but it is not the case on my machine... WindowsTerminal_Ij7jWir1Hh

cktgh commented 8 months ago

I know very little about CRT and I don't quite understand how it could explain the different behavior on Conhost and Termial.

Thanks for the help again!

lhecker commented 7 months ago

The CRT is Microsoft's implementation of C's standard library. It implements C functions like malloc, free but also printf which is what diagon uses to print text to stdout.


You seem to be aware what code pages are so I won't explain that part.

You might be familiar with this Region control panel: image

This control panel selects the value of the special CP_ACP code page (the system default code page as it says). For instance, on my PC the CP_ACP stands for code page 437 and on your PC it's 850. All narrow Windows APIs (the ones with the A at the end) use the CP_ACP. So if you call CreateFileA with a path, that path needs to be encoded in the 437 code page on my system and in 850 on yours.

The problem now is that when I say "all narrow Windows APIs" what I really meant to say is: All narrow Windows APIs except for the console APIs, because the original console API designers unfortunately often had the foresight of a brick wall.

A console application on Windows can simultaneously (!) read input in US OEM 437, write output in Latin1 850, and also read and write files in UTF8 65001. All at the same time! I can see why these 2 additional code pages where added, since it adds a ton of flexibility, but the problem you're seeing is a direct consequence of this flexibility.

Because the CRT now needs to sort of guess what code page you actually want. Given your screenshots, your version seems to always be using the CP_ACP. As explained above, this means that calling chcp won't ever have any affect on it.

I can recommend 2 solutions for this that diagon can implement. To be clear, I have not tested any of these yet: