Output Windows Encoding Problem

oTnTh commented 2 years ago

SciTEUser.properties:

code.page=65001
output.code.page=936

t.au3:

$s = 'test中文测试test'
ConsoleWrite($s & @CRLF)

I think it's an encoding problem. Before update the Output panel of VSCode, AutoIt-VSCode should deal with the Encoding of strings.

Thanks.

vanowm commented 2 years ago

Unable to reproduce ( I copied the code from here, so it could be saved in unicode )

AutoIt-VSCode v1.0.9
Visual Studio Code v1.71.2

As far as I can tell AutoIt-VSCode doesn't use SciTE configurations.

oTnTh commented 2 years ago

The default codepage is depending on the language of Windows settings, for Chinese it's cp936.

So I have to put these in my SciTEUser.properties, to get correct output in the Output Panel of Scite.

code.page=65001
output.code.page=936

VSCode doesn't have similar things like this, and that cause my problems.

Please take a look with this script:

Func _StringToCodepage($sStr, $iCodepage)
    Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, _
            "int", StringLen($sStr), "ptr", 0, "int", 0, "ptr", 0, "ptr", 0)
    Local $tCP = DllStructCreate("char[" & $aResult[0] & "]")
    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, _
            "int", StringLen($sStr), "struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0)
    Return DllStructGetData($tCP, 1)
EndFunc   ;==>_StringToCodepage

$cp = DllCall("kernel32.dll", "int", "GetACP")
ConsoleWrite("Default Codepage: " & $cp[0] & @CRLF)
ConsoleWrite('----------------' & @CRLF)

; Unicode: U+4E2D U+6587
$strA = "中文"
ConsoleWrite("$strA: " & $strA & @CRLF)
ConsoleWrite(String(StringToBinary($strA)) & @CRLF)
ConsoleWrite('----------------' & @CRLF)

$strB = _StringToCodepage($strA, 65001)
ConsoleWrite("$strB: " & $strB & @CRLF)
ConsoleWrite(String(StringToBinary($strB)) & @CRLF)
ConsoleWrite('----------------' & @CRLF)

In Scite, with output.code.page=936, everything worked as expected.

VSCode assumes encoding of output is UTF-8, which is not.

vanowm commented 2 years ago

VSCode doesn't seem to have cp936 Just copy/paste the example code works just fine...can you attach a sample file?

oTnTh commented 2 years ago

cp936 is GBK, a superset of GB2312.

GB18030 is a superset of GBK, but it's a 4-bytes encoding, so it has a new identifier cp54936.

I didn't know anything about VSCode Extension API, if there's no such thing like GetACP(), autoit.outputCodePage is good enough for me.

Before write to the Output Panel of VSCode, convert the output of AutoIt from autoit.outputCodePage to UTF-8, the problem should be solved。

oTnTh commented 2 years ago

The encoding of script file is not relevant to this problem.

Can you show me the output of my script in VSCode, please?

t.au3.txt

vanowm commented 2 years ago

I guess we are out of luck on this one. Almost 7 years since it was requested...

oTnTh commented 2 years ago

WOW, a text editor (sort of) cannot handle text encoding, I didn't expect for that.

Seems there's nothing we can do now.

Thanks for your time.

vanowm commented 2 years ago

Well, technically, if you can see text of your code properly - it handles encoding properly...it's the output of another application that it's having issues with...

oTnTh commented 2 years ago

Even now (Win11 22H2), Powershell and CMD use ANSI (aka cp936 for Chinese) as the default code page.

If I compile my script as a CUI EXE, here is the output:

Same as the output in Scite.

ConsoleWrite intend to write something to STDOUT, and the default codepage of STDOUT is ANSI.

As a user, I would love to have a solution, but I can't say that Autoit is wrong.

Also, I think it's not fair to you. You did a greate job, but CJK users have to choose.

vanowm commented 2 years ago

Maybe as a work around you could use this for now: https://www.autoitscript.com/forum/topic/208189--

vanowm commented 2 years ago

Proposed #123 adds new option Output Code Page. In this particular case I had to set it to gbk in order to get proper result:

With cp936 I get different $strB result:

oTnTh commented 2 years ago

WOW, thank you for keep working on this.

strB is not a valid GBK string, so when we try to encode strB from GBK to UTF-8, the result is meaningless.

I tink you can ignore the difference in AutoIt-VSCode.

However, they do have some differences between GBK and CP936.

You could consider GBK as ECMAScript7, and CP936 as Crhome V8.

If a code-point is undefined in the standard, the author of charmap could make the decision how to handle the conversion.

Take a look at this:

var encoding = require('encoding');

buf = Buffer.from([0xe4, 0xb8, 0xad, 0xe6, 0x96, 0x87])
resultB1 = encoding.convert(buf, 'utf-8', 'gbk')
resultB2 = encoding.convert(buf, 'utf-8', 'cp936')
console.log(resultB1)
console.log(resultB2)
console.log('-----------------------------------')

resultC1 = encoding.convert(resultB1, 'gbk', 'utf-8')
resultC2 = encoding.convert(resultB2, 'cp936', 'utf-8')
console.log(resultC1)
console.log(resultC2)

Output:

<Buffer e6 b6 93 ee 85 9f e6 9e 83>
<Buffer e6 b6 93 ef bf bd e9 8f 82 ef bf bd>
-----------------------------------
<Buffer e4 b8 ad e6 96 87>
<Buffer e4 b8 3f e6 96 3f>

Even though strB is not a valid GBK string, after two conversions, with GBK argument, we didn't lose any data.

I'm not sure, but I guess that's why the GBK charmap of iconv-lite is not compatible with CP936.

vanowm commented 2 years ago

It's all Chinese to me (pun intended)

Maybe it would be more suitable to report it at iconv-lite If PR goes forward, it will use iconv-lite library instead of encoding

oTnTh commented 2 years ago

It's not a bug of iconv-lite, Chinese people would recognize the differences between GBK and CP936, they have to.

The text encodings are real pain in the ass, really. The problems could jump out everywhere.

But for English native speakers, they didn't use it, and hard to explain to them. Like you said, it's all Chinese.

So I'm very grateful for you, I do.

vanowm commented 2 years ago

So, the question is, does SciTe has the same issue? (cause on your screenshot it looks exactly like in vcode after conversion) or should I suspend the PR until we find 100% working solution?

oTnTh commented 2 years ago

Generally, when we saw messy codes, it just means "Something is wrong here".

As long as iconv-lite handle the normal text correctly, I think we can ignore the details.

loganch / AutoIt-VSCode

Output Windows Encoding Problem #67