mingw-w64 / mingw-w64.github.io

mingw-w64.net web page contents (The new web page)
Other
572 stars 1.16k forks source link

gdb has commandline encoding problem #47

Closed TsXor closed 1 year ago

TsXor commented 1 year ago
>gdb --version
GNU gdb (GDB) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I used the following code to get a wargv in a GUI application.

static struct MyArgsW {
    wchar_t* basestr;
    int argc;
    wchar_t** argv;
    MyArgsW() {
        setlocale(LC_ALL, "");
        this->basestr = GetCommandLineW();
        this->argv = CommandLineToArgvW(this->basestr, &this->argc);
    }
    ~MyArgsW() {
        LocalFree(this->argv);
    }
} wargs;

However, when I launched the program with the following arguments (in vscode debug), it could not behave normally and will give wrong number of argc. When used without gdb, it does what it should do.

            "args": [
                "--userdata-path", "E:\\Users\\23Xor\\Desktop\\OpenWebview2Window-ng\\dist\\bin\\x64\\webdata",
                "--navigate-url", "https://www.baidu.com",
                "--window-title", "百度一下",
                "--tray-control"
            ],

Here is the part of the memory of commandline got by GetCommandLineW. 21:53:59

We can see that "百度一下" was encoded to the following: 21:57:21

It seem to be an encoding problem, so I did an experiment to make sure.

百度一下
(encode with UTF-8)--->
\xE799BE\xE5BAA6\xE4B880\xE4B88B
(decode with GB2312, the ANSI codepage of my environment)--->
鐧惧害涓€涓?
(encode with UTF-16)--->
\u2794\uE760\uB35B\u936D\uAC20\u936D\u3F00

So let me guess, gdb give the commandline in UTF-8 encoding, but actually commandline is interpreted with ANSI encoding, and converted to UTF-16 encoding.

Biswa96 commented 1 year ago

Why did you report gdb issue in mingw-w64 repository? Also try to use latest gdb version 13.2.

lhmouse commented 1 year ago

Not our bug. Read this first: https://github.com/msys2/MINGW-packages/issues/17398

TsXor commented 1 year ago

Not our bug. Read this first: msys2/MINGW-packages#17398

我认为是,您提到的这个issue的问题是gdb在向控制台输出的时候使用的是UTF-8编码,但是控制台默认在不chcp 65001的情况下是按照ANSI解读并显示造成的。其中提到的解决方法SetConsoleOutputCP(65001)就是使控制台编码 == UTF-8,从而使解读正确,能正常显示。 事实上在使用linux终端程序的Windows移植的时候几乎是必须先加上一句chcp 65001

但是,依照我的另一个实验,在Windows自己正常启动程序的情况下,程序的命令行都是按ANSI编码输入的,同时这个行为不是chcp能改变的。

lhmouse commented 1 year ago

但是,依照我的另一个实验,在Windows自己正常启动程序的情况下,程序的命令行都是按ANSI编码输入的,同时这个行为不是chcp能改变的。

argv 的解析并不由 mingw-w64 负责。MSVCRT 的时代这是一个库函数 https://github.com/mingw-w64/mingw-w64/blob/master/mingw-w64-crt/crt/crtexe.c#L141;但是 UCRT 开始似乎变成了一个不知为何预解析好的变量,导出了一个函数取它的地址 https://github.com/mingw-w64/mingw-w64/blob/024035cfd9059ebd39d95ab22b009ef3b88b4040/mingw-w64-crt/crt/ucrtbase_compat.c#L67。这些东西理所当然地不受 console CP 的影响,因为不涉及 I/O,并且都是黑盒。无能为力,如果你觉得是个 bug,请报告给微软。

TsXor commented 1 year ago

但是,依照我的另一个实验,在Windows自己正常启动程序的情况下,程序的命令行都是按ANSI编码输入的,同时这个行为不是chcp能改变的。

argv 的解析并不由 mingw-w64 负责。MSVCRT 的时代这是一个库函数 https://github.com/mingw-w64/mingw-w64/blob/master/mingw-w64-crt/crt/crtexe.c#L141;但是 UCRT 开始似乎变成了一个不知为何预解析好的变量,导出了一个函数取它的地址 https://github.com/mingw-w64/mingw-w64/blob/024035cfd9059ebd39d95ab22b009ef3b88b4040/mingw-w64-crt/crt/ucrtbase_compat.c#L67。这些东西理所当然地不受 console CP 的影响,因为不涉及 I/O,并且都是黑盒。无能为力,如果你觉得是个 bug,请报告给微软。

我提出这个issue(虽然提错地方了)就是因为它最后有一个解决方法。事实上我们只要让gdb在创建进程时使用Win32的W系函数就行了,因为W系函数保证只接受UTF-16。即使gcc用的是spawn,也有_wspawn这种W系版本。

我刚才又试了一下,如果把程序名改成中文,gdb会直接启动失败。这意味着很可能是gdb创建子进程的时候cmdline的编码不对。

TsXor commented 1 year ago

close this and goto gdb

lhmouse commented 1 year ago

Also see this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108865