Open waitlamp opened 2 years ago
unrelated to MIEngine#1025
I've also encountered this problem and found it a bit weird.
To begin with, the Chinese message file of gcc/g++
is <GCC install dir>\share\locale\zh_CN\LC_MESSAGES\gcc.mo
, whose character encoding is UTF-8.
It seems that gcc/g++
first converts the Chinese message into a multibyte string of your system encoding and then always tries to decode it with GBK encoding when outputting it to console even if your system encoding is UTF-8.
Both dumpbin /imports gcc.exe
and objdump -x gcc.exe
suggest that gcc/g++
imports functions like libiconv
from libiconv-2.dll
, which is probably how gcc/g++
does encoding conversions.
Besides, I am debugging gcc/g++
with cdb
and gdb
these days, which is quite an awkward experience since I have to debug it instruction by instruction.
Fortunately, I've worked out the way gcc/g++
outputs messages:
use fputc
to output:
用法:g++.exe [选项] 文件...
以 -g、-f、-m、-O、-W 或 --param 开头的选项将由 g++.exe 自动传递给其调用的
不同子进程。若要向这些进程传递其他选项,必须使用 -W<字母> 选项。
报告程序缺陷的步骤请参见:
<https://github.com/jmeubank/tdm-gcc/issues>.
fputs
to output:
选项:
-pass-exit-codes 在某一阶段退出时返回其中最高的错误码。
--help 显示此帮助说明。
--target-help 显示目标机器特定的命令行选项。
--help={common|optimizers|params|target|warnings|[^]{joined|separate|undocumented}}[,...]。
显示特定类型的命令行选项。
(使用‘-v --help’显示子进程的命令行参数)。
...
The debugging process:
C:\Users\ivan>"C:\Program Files (x86)\Windows Kits\10\Debuggers\x86\cdb.exe" "C:\Program Files\TDM-GCC-64\bin\g++.exe" --help
...
0:000> bm msvcrt!*put*
1: 772adc81 @!"msvcrt!_cprinput_l"
2: 772b38be @!"msvcrt!_cprinput_l"
...
0:000> bm msvcrt!*print*
55: 772b6d70 @!"msvcrt!_fwprintf_l"
56: 772b8f10 @!"msvcrt!_vfprintf_s_l"
...
0:000> g
...
0:000> g
g
Breakpoint 39 hit
eax=00006000 ebx=0073fd34 ecx=772f9620 edx=0000002b esi=01318d71 edi=01318d76
eip=772bbd70 esp=0073fcac ebp=00968075 iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
msvcrt!fputc:
772bbd70 6a10 push 10h
0:000> g
Breakpoint 39 hit eax=00006000 ebx=0073fd34 ecx=772f9620 edx=0000002b esi=01318d72 edi=01318d76 eip=772bbd70 esp=0073fcac ebp=00968075 iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 msvcrt!fputc: 772bbd70 6a10 push 10h 0:000> g
Breakpoint 39 hit eax=00006000 ebx=0073fd34 ecx=772f9620 edx=0000002e esi=01318d73 edi=01318d76 eip=772bbd70 esp=0073fcac ebp=00968075 iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 msvcrt!fputc: 772bbd70 6a10 push 10h 0:000> g ... 0:000> g
Breakpoint 48 hit eax=0096459d ebx=004a4410 ecx=00000000 edx=00000000 esi=772f9620 edi=00000000 eip=772bbeb0 esp=0073fdcc ebp=009839f0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 msvcrt!fputs: 772bbeb0 6a10 push 10h 0:000> g 閫夐」锛? Breakpoint 48 hit eax=00955d85 ebx=004a4410 ecx=00000000 edx=00000000 esi=772f9620 edi=00000000 eip=772bbeb0 esp=0073fdcc ebp=009839f0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 msvcrt!fputs: 772bbeb0 6a10 push 10h 0:000> g -pass-exit-codes 鍦ㄦ煇涓€闃舵閫€鍑烘椂杩斿洖鍏朵腑鏈€楂樼殑閿欒鐮併€? Breakpoint 48 hit eax=0095539c ebx=004a4410 ecx=00000000 edx=00000000 esi=772f9620 edi=00000000 eip=772bbeb0 esp=0073fdcc ebp=009839f0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 msvcrt!fputs: 772bbeb0 6a10 push 10h 0:000> g ...
Good news! I've almost worked it out. Let me elaborate my debugging process and the solution.
First, launch the debugger: cdb g++ --help
.
Second, set breakpoints on some crucial functions:
0:000> bp msvcrt!setlocale
0:000> bp libiconv_2!libiconv_open
*** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\libiconv-2.dll
0:000> bl
0 e 75587640 0001 (0001) 0:**** msvcrt!setlocale
1 e 7cad4590 0001 (0001) 0:**** libiconv_2!libiconv_open
Then we should try to inspect these functions:
setlocale
is char *__cdecl setlocale(int _Category, const char *_Locale)
. The address of its first parameter int _Category
is @esp + 4
and its value is 0x00000000
(i.e. LC_ALL
). The address of its second parameter const char *_Locale
is @esp + 8
and its value is 0x00504c2c
, which points to the string ""
. So what g++
does here is setlocale(LC_ALL, "")
.
0:000> g
Breakpoint 0 hit
*** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\g++.exe
eax=755e9640 ebx=0073fea0 ecx=0073fea0 edx=00000002 esi=007e4528 edi=007e4570
eip=75587640 esp=0073fdec ebp=0073fec8 iopl=0 nv up ei pl nz na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206
msvcrt!setlocale:
75587640 6a18 push 18h
0:000> dd @esp
0073fdec 00438a77 00000000 00504c2c 75577e92
0073fdfc 0073fea4 007e4528 007e4570 0073fea0
0073fe0c 0040c68d 00000002 007e4528 00080000
0073fe1c 0073fea4 0073fea8 7558823f 00000080
0073fe2c 0073fea0 007e4528 007e4570 0073fea0
0073fe3c 004aa04d 00000002 007e4528 007e3ab4
0073fe4c 0073fe58 75598775 755e931c 0073fe98
0073fe5c 75597c3d 00000008 7558817d 75588163
0:000> da 504c2c
00504c2c ""
libiconv_open
is iconv_t libiconv_open(const char* tocode, const char* fromcode)
. The address of its first parameter const char* tocode
is @esp + 4
and its value is 0x0073faf0
, which points to the string "CP65001//TRANSLIT"
. The address of its second parameter const char* fromcode
is @esp + 8
and its value is 0x0073fb10
, which points to the string "UTF-8"
. So what g++
does here is libiconv_open("CP65001//TRANSLIT", "UTF-8")
.
0:000> g
Breakpoint 1 hit
eax=0073fb10 ebx=01073f76 ecx=0073faf7 edx=0073faf0 esi=005211ac edi=007e2110
eip=7cad4590 esp=0073fadc ebp=0073fb68 iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
libiconv_2!libiconv_open:
7cad4590 55 push ebp
0:000> dd @esp
0073fadc 0045f628 0073faf0 0073fb10 00000007
0073faec 7558e170 35365043 2f313030 4152542f
0073fafc 494c534e 00520054 0000002f 00000005
0073fb0c 00000000 2d465455 00000038 00000005
0073fb1c 0045f52f 010740c9 0051480e 00514804
0073fb2c 0073fb4c 0073fb48 00000007 0073fb10
0073fb3c 0073faf0 0000004c 0849f830 0073fb68
0073fb4c 000001f8 007e0000 00000000 0000004c
0:000> da 73faf0
0073faf0 "CP65001//TRANSLIT"
0:000> da 73fb10
0073fb10 "UTF-8"
As you can see, libiconv
works correctly converting the encodings. The actual problem is that setlocale
conflicts with fputc
and fputs
. Therefore, we can replace the caller of setlocale
with nop
instructions to solve the problem.
0:000> .restart
...
0:000> bl
0:000> bp msvcrt!setlocale
0:000> bl
0 e 75587640 0001 (0001) 0:**** msvcrt!setlocale
0:000> g
Breakpoint 0 hit
*** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\g++.exe
eax=755e9640 ebx=0073fea0 ecx=0073fea0 edx=00000002 esi=01104528 edi=01104570
eip=75587640 esp=0073fdec ebp=0073fec8 iopl=0 nv up ei pl nz na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206
msvcrt!setlocale:
75587640 6a18 push 18h
0:000> k
ChildEBP RetAddr
0073fde8 00438a77 msvcrt!setlocale
WARNING: Stack unwind information not available. Following frames may be wrong.
0073fec8 00401396 g__+0x38a77
0073ff68 75ef7ba9 g__+0x1396
0073ff84 772fb79b KERNEL32!BaseThreadInitThunk+0x19
0073ffdc 772fb71f ntdll!__RtlUserThreadStart+0x2b
0073ffec 00000000 ntdll!_RtlUserThreadStart+0x1b
RetAddr
of setlocale
is 0x00438a77
, which is the address of the instruction right after the caller. So the address of the caller is 0x00438a72
.
0:000> u 438a77 L-5
g__+0x38a72:
00438a72 e889b10600 call g__+0xa3c00 (004a3c00)
00438a77 c74424042d4c5000 mov dword ptr [esp+4],offset g__+0x104c2d (00504c2d)
nop
instructions:
0:000> .restart
...
0:000> u 438a72
*** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\g++.exe
g__+0x38a72:
00438a72 e889b10600 call g__+0xa3c00 (004a3c00)
00438a77 c74424042d4c5000 mov dword ptr [esp+4],offset g__+0x104c2d (00504c2d)
00438a7f c70424484c5000 mov dword ptr [esp],offset g__+0x104c48 (00504c48)
00438a86 e8a5680200 call g__+0x5f330 (0045f330)
00438a8b c70424484c5000 mov dword ptr [esp],offset g__+0x104c48 (00504c48)
00438a92 e8097c0200 call g__+0x606a0 (004606a0)
00438a97 c704244c4c5000 mov dword ptr [esp],offset g__+0x104c4c (00504c4c)
00438a9e e8cd690200 call g__+0x5f470 (0045f470)
0:000> eb 438a72 90 90 90 90 90
0:000> u 438a72
g__+0x38a72:
00438a72 90 nop
00438a73 90 nop
00438a74 90 nop
00438a75 90 nop
00438a76 90 nop
00438a77 c74424042d4c5000 mov dword ptr [esp+4],offset g__+0x104c2d (00504c2d)
00438a7f c70424484c5000 mov dword ptr [esp],offset g__+0x104c48 (00504c48)
00438a86 e8a5680200 call g__+0x5f330 (0045f330)
0:000> g
用法:g++ [选项] 文件...
选项:
-pass-exit-codes 在某一阶段退出时返回其中最高的错误码。
--help 显示此帮助说明。
--target-help 显示目标机器特定的命令行选项。
--help={common|optimizers|params|target|warnings|[^]{joined|separate|undocumented}}[,...]。
显示特定类型的命令行选项。
(使用‘-v --help’显示子进程的命令行参数)。
...
Finally, it's time to modify g++.exe
!
objdump -h g++.exe
. The VMA
of .text
is 0x00401000
, and its File off
is 0x00000400
.
g++.exe: file format pei-i386
Sections: Idx Name Size VMA LMA File off Algn 0 .text 000adbd8 00401000 00401000 00000400 24 CONTENTS, ALLOC, LOAD, READONLY, CODE, DATA 1 .data 00000804 004af000 004af000 000ae000 25 CONTENTS, ALLOC, LOAD, DATA 2 .rdata 0006cde0 004b0000 004b0000 000aea00 25 CONTENTS, ALLOC, LOAD, READONLY, DATA 3 .bss 000051d4 0051d000 0051d000 00000000 25 ALLOC 4 .idata 00001688 00523000 00523000 0011b800 22 CONTENTS, ALLOC, LOAD, DATA 5 .CRT 00000038 00525000 00525000 0011d000 22 CONTENTS, ALLOC, LOAD, DATA 6 .tls 00000008 00526000 00526000 0011d200 22 CONTENTS, ALLOC, LOAD, DATA 7 .rsrc 000004e8 00527000 00527000 0011d400 22 CONTENTS, ALLOC, LOAD, DATA 8 .reloc 0000d4f0 00528000 00528000 0011da00 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA
2. Calculate the `File off` of the caller metioned before.
0x438a72 - 0x401000 + 0x400 = 0x37e72
3. Replace the `0x37e72-0x37e76`th bytes of `g++.exe` with `0x90`s.
# Remaining problem(s)
1. See the picture below. Maybe the problem is related to escape sequence strings.
![image](https://github.com/jmeubank/tdm-gcc/assets/110970449/2eb58455-8dc6-45bf-92b8-6a32edcdbaf7)
The encoding issue is still there. This is how it looks like in PowerShell (similar situation in cmd):
PS D:\> gcc --help
ó?·¨£ogcc.exe [????] ???t...
????£o
-pass-exit-codes ?ú?3ò??×??í?3?ê±·μ?????D×???μ?′í?ó???£
--help ??ê?′?°??ú?μ?÷?£
--target-help ??ê???±ê?ú?÷ì??¨μ??üá?DD?????£
--help={common|optimizers|params|target|warnings|[^]{joined|separate|undocumented}}[,...]?£
??ê?ì??¨ààDíμ??üá?DD?????£
£¨ê1ó???-v --help?ˉ??ê?×ó??3ìμ??üá?DD2?êy£??£
--version ??ê?±àò??÷°?±?D??¢?£
-dumpspecs ??ê??ùóD?ú?¨ spec ×?·?′??£
-dumpversion ??ê?±àò??÷μ?°?±?o??£
-dumpmachine ??ê?±àò??÷μ???±ê′|àí?÷?£
-print-search-dirs ??ê?±àò??÷μ????÷?·???£
-print-libgcc-file-name ??ê?±àò??÷°é???aμ???3??£
-print-file-name=<?a> ??ê? <?a> μ?íê???·???£
-print-prog-name=<3ìDò> ??ê?±àò??÷×é?t <3ìDò> μ?íê???·???£
-print-multiarch ??ê???±êμ?±ê×? GNU èy?a×飨±?ó?óú?a?·??μ?ò?2?·?£??£
-print-multi-directory ??ê?2?í?°?±? libgcc μ??ù?????£
-print-multi-lib ??ê??üá?DD????oí?à??°?±??a???÷?·????μ?ó3é??£
-print-multi-os-directory ??ê?2ù×÷?μí3?aμ??à???·???£
-print-sysroot ??ê???±ê?a?????£
-print-sysroot-headers-suffix ??ê?ó?óú?°?òí·???tμ? sysroot oó×o?£
-Wa,<????> ???oo?·???μ? <????> ′?μY????±à?÷?£
-Wp,<????> ???oo?·???μ? <????> ′?μY???¤′|àí?÷?£
-Wl,<????> ???oo?·???μ? <????> ′?μY??á′?ó?÷?£
-Xassembler <2?êy> ?? <2?êy> ′?μY????±à?÷?£
-Xpreprocessor <2?êy> ?? <2?êy> ′?μY???¤′|àí?÷?£
-Xlinker <2?êy> ?? <2?êy> ′?μY??á′?ó?÷?£
-save-temps 2?é?3y?D?????t?£
-save-temps=<2?êy> 2?é?3y?D?????t?£
-no-canonical-prefixes éú3é???? gcc ×é?tμ??à???·??ê±2?éú3é1?·??ˉμ?
?°×o?£
-pipe ê1ó?1üμà′úì?áùê±???t?£
-time ?a????×ó??3ì??ê±?£
-specs=<???t> ó? <???t> μ??úèY?2???ú?¨μ? specs ???t?£
-std=<±ê×?> ?ù?¨ê?è??′???t×??-???¨μ?±ê×??£
--sysroot=<????> ?? <????> ×÷?aí·???toí?a???tμ??ù?????£
-B <????> ?? <????> ìí?óμ?±àò??÷μ????÷?·???D?£
-v ??ê?±àò??÷μ÷ó?μ?3ìDò?£
-### ó? -v àà??£?μ?????±?òyo?à¨×?£?2¢?ò2??′DD?üá??£
-E ??×÷?¤′|àí£?2???DD±àò??¢??±à?òá′?ó?£
-S ±àò?μ???±àó???£?2???DD??±àoíá′?ó£?
-c ±àò??¢??±àμ???±ê′ú??£?2???DDá′?ó?£
-o <???t> ê?3?μ? <???t>?£
-pie éú3é?ˉì?á′?óμ??????T1??é?′DD???t?£
-shared éú3éò???12?í?a?£
-x <ó???> ???¨??oóê?è????tμ?ó????£
?êDíμ?ó???°üਣoc?¢c++?¢assembler?¢none
??none?ˉòa??×????′??è?DD?a£??′?ù?Y???tμ?à??1??2?2a
?′???tμ?ó????£
ò? -g?¢-f?¢-m?¢-O?¢-W ?ò --param ?aí·μ???????óé gcc.exe ×??ˉ′?μY????μ÷ó?μ?
2?í?×ó??3ì?£è?òa?ò?aD???3ì′?μY????????£?±?D?ê1ó? -W<×???> ?????£
±¨??3ìDòè±?Yμ?2??è??2???£o
<https://github.com/jmeubank/tdm-gcc/issues>.
ChatGPT give a temporary fix, by dumping the output to a file:
PS D:\> gcc --help > output.txt
And this is how it looks like in the notepad:
用法:gcc.exe [选项] 文件...
选项:
-pass-exit-codes 在某一阶段退出时返回其中最高的错误码。
--help 显示此帮助说明。
--target-help 显示目标机器特定的命令行选项。
--help={common|optimizers|params|target|warnings|[^]{joined|separate|undocumented}}[,...]。
显示特定类型的命令行选项。
(使用‘-v --help’显示子进程的命令行参数)。
--version 显示编译器版本信息。
-dumpspecs 显示所有内建 spec 字符串。
-dumpversion 显示编译器的版本号。
-dumpmachine 显示编译器的目标处理器。
-print-search-dirs 显示编译器的搜索路径。
-print-libgcc-file-name 显示编译器伴随库的名称。
-print-file-name=<库> 显示 <库> 的完整路径。
-print-prog-name=<程序> 显示编译器组件 <程序> 的完整路径。
-print-multiarch 显示目标的标准 GNU 三元组(被用于库路径的一部分)。
-print-multi-directory 显示不同版本 libgcc 的根目录。
-print-multi-lib 显示命令行选项和多个版本库搜索路径间的映射。
-print-multi-os-directory 显示操作系统库的相对路径。
-print-sysroot 显示目标库目录。
-print-sysroot-headers-suffix 显示用于寻找头文件的 sysroot 后缀。
-Wa,<选项> 将逗号分隔的 <选项> 传递给汇编器。
-Wp,<选项> 将逗号分隔的 <选项> 传递给预处理器。
-Wl,<选项> 将逗号分隔的 <选项> 传递给链接器。
-Xassembler <参数> 将 <参数> 传递给汇编器。
-Xpreprocessor <参数> 将 <参数> 传递给预处理器。
-Xlinker <参数> 将 <参数> 传递给链接器。
-save-temps 不删除中间文件。
-save-temps=<参数> 不删除中间文件。
-no-canonical-prefixes 生成其他 gcc 组件的相对路径时不生成规范化的
前缀。
-pipe 使用管道代替临时文件。
-time 为每个子进程计时。
-specs=<文件> 用 <文件> 的内容覆盖内建的 specs 文件。
-std=<标准> 假定输入源文件遵循给定的标准。
--sysroot=<目录> 将 <目录> 作为头文件和库文件的根目录。
-B <目录> 将 <目录> 添加到编译器的搜索路径中。
-v 显示编译器调用的程序。
-### 与 -v 类似,但选项被引号括住,并且不执行命令。
-E 仅作预处理,不进行编译、汇编或链接。
-S 编译到汇编语言,不进行汇编和链接,
-c 编译、汇编到目标代码,不进行链接。
-o <文件> 输出到 <文件>。
-pie 生成动态链接的位置无关可执行文件。
-shared 生成一个共享库。
-x <语言> 指定其后输入文件的语言。
允许的语言包括:c、c++、assembler、none
‘none’意味着恢复默认行为,即根据文件的扩展名猜测
源文件的语言。
以 -g、-f、-m、-O、-W 或 --param 开头的选项将由 gcc.exe 自动传递给其调用的
不同子进程。若要向这些进程传递其他选项,必须使用 -W<字母> 选项。
报告程序缺陷的步骤请参见:
<https://github.com/jmeubank/tdm-gcc/issues>.
I've also solved the aforementioned remaining problem recently.
Remaining problem(s)
- See the picture below. Maybe the problem is related to escape sequence strings.
It turns out that gcc/g++
outputs this message using KERNEL32!WriteFile
. After looking into the assembly near the calling of this function, I've found out that we just need to nop
an instruction here:
*** 56877,56883 ****
435283: 89 df mov %ebx,%edi
435285: 83 c3 01 add $0x1,%ebx
435288: 83 f8 1f cmp $0x1f,%eax // probably determine whether a chunk ends?
! 43528b: 0f 86 e7 03 00 00 jbe 0x435678 // nop this instruction
435291: 0f b6 2b movzbl (%ebx),%ebp
435294: 83 f9 1b cmp $0x1b,%ecx // whether the character is ESC
435297: 0f 94 c2 sete %dl
--- 56877,56888 ----
435283: 89 df mov %ebx,%edi
435285: 83 c3 01 add $0x1,%ebx
435288: 83 f8 1f cmp $0x1f,%eax
! 43528b: 90 nop
! 43528c: 90 nop
! 43528d: 90 nop
! 43528e: 90 nop
! 43528f: 90 nop
! 435290: 90 nop
435291: 0f b6 2b movzbl (%ebx),%ebp
435294: 83 f9 1b cmp $0x1b,%ecx
435297: 0f 94 c2 sete %dl
The File off
of the instruction:
0x43528b - 0x401000 + 0x400 = 0x3468b
Meanwhile, let me answer another participant's question here.
@CFSO6459 What's your [console]::InputEncoding
, [console]::OutputEncoding
and $OutputEncoding
in powershell
? It seems that the problem is related to the encoding settings of powershell
rather than gcc/g++
.
Hello, Thank you for providing such software, which is useful for beginners in C.
I'm a beginner who is just starting to learn C, and I encountered a lot of problems with windows unicode encoding.
as #7 says, GDB cannot recognize non-ascii character.
A solution is set option
Beta: Use Unicode UTF-8 for worldwide language support
in windows locale setting.for short, windows used many codepage (eg. GBK for Chinese) to support worldwide language when unicode was not widely used. Retained until now for compatibility.
as mingw mailing list and https://github.com/Microsoft/vscode-cpptools/issues/3444 mentioned, ASCII character almost always works properly, but non-ascii character may become garble.
when using GBK (
chcp
displays 936)when setting ”Beta: Use Unicode UTF-8 for worldwide language support” (
chcp
displays 65001)TDM-GCC displays garble like #36
so is it possible TDM-GCC can dispaly info correctly when setting ”Beta: Use Unicode UTF-8 for worldwide language support”? or it is a windows console issue that can't be fix by TDM-GCC ?
thanks.