Open cktgh opened 8 months ago
I am unsure as to whether this would help identifying the issue, but here are a few thing I noticed:
diagon Math -- "1+1/2 + sum(i,0,10) = 112/2" | scb
would fill our clipboard with incorrectly encoded data.diagon Math -- "1+1/2 + sum(i,0,10) = 112/2" > temp.txt
would fill temp.txt
with correctly encoded data.I must be doing something wrong
I noticed that you're using Windows 10. Thinking about this some more, this may be an issue with Windows 10, since its CRT (C/C++ stdlib) version is older, and the CRT never had excellent Unicode support. It's much better now, but it's still somewhat broken for surrogate pairs and such.
I believe this is the case, because this issue doesn't reproduce for me on Windows 11, and it doesn't make any sense why chcp would have no effect for you. chcp definitely does work correctly, because I know that it changes the result of GetConsoleOutputCP
and we haven't touched either of the two in a decade.
In other words, I think this may be an CRT issue. You could potentially fix it by compiling diagon yourself with the latest version of the Windows SDK. Potentially you need to statically link the CRT.
I'll leave this issue open because I can't really prove that it's due to the CRT.
I noticed in your screenshot that the output is correct on code page 437, but it is not the case on my machine...
I know very little about CRT and I don't quite understand how it could explain the different behavior on Conhost and Termial.
Thanks for the help again!
The CRT is Microsoft's implementation of C's standard library. It implements C functions like malloc
, free
but also printf
which is what diagon uses to print text to stdout.
You seem to be aware what code pages are so I won't explain that part.
You might be familiar with this Region control panel:
This control panel selects the value of the special CP_ACP
code page (the system default code page as it says). For instance, on my PC the CP_ACP
stands for code page 437 and on your PC it's 850. All narrow Windows APIs (the ones with the A
at the end) use the CP_ACP
. So if you call CreateFileA
with a path, that path needs to be encoded in the 437 code page on my system and in 850 on yours.
The problem now is that when I say "all narrow Windows APIs" what I really meant to say is: All narrow Windows APIs except for the console APIs, because the original console API designers unfortunately often had the foresight of a brick wall.
A console application on Windows can simultaneously (!) read input in US OEM 437, write output in Latin1 850, and also read and write files in UTF8 65001. All at the same time! I can see why these 2 additional code pages where added, since it adds a ton of flexibility, but the problem you're seeing is a direct consequence of this flexibility.
Because the CRT now needs to sort of guess what code page you actually want. Given your screenshots, your version seems to always be using the CP_ACP
. As explained above, this means that calling chcp
won't ever have any affect on it.
I can recommend 2 solutions for this that diagon can implement. To be clear, I have not tested any of these yet:
CP_ACP
to UTF-8
. BUT it will not change the console code page(s). That requires an explicit call to SetConsoleOutputCP
at the start of the program.
Windows Terminal version
1.18.10301.0
Windows build number
10.0.19045.0
Other Software
PowerShell 7.4.1 Diagon (
diagon.exe
from diagon-1.1.156-win64.zip)Steps to reproduce
This is an issue that occurs on Terminal but not Conhost.
Executing the command
diagon Math -- "1+1/2 + sum(i,0,10) = 112/2"
would output:The output is correct on UTF-8 (65001) and would be broken on "Multilingual (Latin I)" (850).
When incorrect in encoding:
Expected Behavior
Output should be correct when encoding is UTF-8 (setting code page to 65001)
Actual Behavior
The output is incorrect on Terminal even when the encoding is UTF-8. I have verified this with
chcp
and the code page is65001
Screenshots for reference
Conhost (correct behaviour)
WIndows Terminal (incorrect behaviour)