jmeubank / tdm-gcc

TDM-GCC is a cleverly disguised GCC compiler for Windows!
https://jmeubank.github.io/tdm-gcc/
584 stars 49 forks source link

TDM-GCC UTF-8 Compatibility and non-ASCII character in windows #48

Open waitlamp opened 2 years ago

waitlamp commented 2 years ago

Hello, Thank you for providing such software, which is useful for beginners in C.

I'm a beginner who is just starting to learn C, and I encountered a lot of problems with windows unicode encoding.

as #7 says, GDB cannot recognize non-ascii character.

mentioned in https://github.com/microsoft/vscode-cpptools/issues/602

gdberror

A solution is set option Beta: Use Unicode UTF-8 for worldwide language support in windows locale setting.

Microsoft mentioned UTF-8 problem in their blog I also find windows console may not recognized non-ascii input correctly https://github.com/microsoft/WSL/issues/7675 i'm not sure if it is related to MIEngine#1025

for short, windows used many codepage (eg. GBK for Chinese) to support worldwide language when unicode was not widely used. Retained until now for compatibility.

as mingw mailing list and https://github.com/Microsoft/vscode-cpptools/issues/3444 mentioned, ASCII character almost always works properly, but non-ascii character may become garble.

when using GBK (chcp displays 936) correcttips

when setting ”Beta: Use Unicode UTF-8 for worldwide language support” (chcp displays 65001) errortips gcc--help

TDM-GCC displays garble like #36

so is it possible TDM-GCC can dispaly info correctly when setting ”Beta: Use Unicode UTF-8 for worldwide language support”? or it is a windows console issue that can't be fix by TDM-GCC ?

thanks.

imba-tjd commented 2 years ago

unrelated to MIEngine#1025

Shuangcheng-Ni commented 1 year ago

I've also encountered this problem and found it a bit weird.

To begin with, the Chinese message file of gcc/g++ is <GCC install dir>\share\locale\zh_CN\LC_MESSAGES\gcc.mo, whose character encoding is UTF-8.

It seems that gcc/g++ first converts the Chinese message into a multibyte string of your system encoding and then always tries to decode it with GBK encoding when outputting it to console even if your system encoding is UTF-8.

Both dumpbin /imports gcc.exe and objdump -x gcc.exe suggest that gcc/g++ imports functions like libiconv from libiconv-2.dll, which is probably how gcc/g++ does encoding conversions.

Besides, I am debugging gcc/g++ with cdb and gdb these days, which is quite an awkward experience since I have to debug it instruction by instruction.

Fortunately, I've worked out the way gcc/g++ outputs messages:

Breakpoint 48 hit eax=0096459d ebx=004a4410 ecx=00000000 edx=00000000 esi=772f9620 edi=00000000 eip=772bbeb0 esp=0073fdcc ebp=009839f0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 msvcrt!fputs: 772bbeb0 6a10 push 10h 0:000> g 閫夐」锛? Breakpoint 48 hit eax=00955d85 ebx=004a4410 ecx=00000000 edx=00000000 esi=772f9620 edi=00000000 eip=772bbeb0 esp=0073fdcc ebp=009839f0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 msvcrt!fputs: 772bbeb0 6a10 push 10h 0:000> g -pass-exit-codes 鍦ㄦ煇涓€闃舵閫€鍑烘椂杩斿洖鍏朵腑鏈€楂樼殑閿欒鐮併€? Breakpoint 48 hit eax=0095539c ebx=004a4410 ecx=00000000 edx=00000000 esi=772f9620 edi=00000000 eip=772bbeb0 esp=0073fdcc ebp=009839f0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 msvcrt!fputs: 772bbeb0 6a10 push 10h 0:000> g ...

Shuangcheng-Ni commented 1 year ago

Good news! I've almost worked it out. Let me elaborate my debugging process and the solution.

Debugging process

First, launch the debugger: cdb g++ --help.

Second, set breakpoints on some crucial functions:

0:000> bp msvcrt!setlocale
0:000> bp libiconv_2!libiconv_open
*** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\libiconv-2.dll
0:000> bl
 0 e 75587640     0001 (0001)  0:**** msvcrt!setlocale
 1 e 7cad4590     0001 (0001)  0:**** libiconv_2!libiconv_open

Then we should try to inspect these functions:

  1. The signature of setlocale is char *__cdecl setlocale(int _Category, const char *_Locale). The address of its first parameter int _Category is @esp + 4 and its value is 0x00000000 (i.e. LC_ALL). The address of its second parameter const char *_Locale is @esp + 8 and its value is 0x00504c2c, which points to the string "". So what g++ does here is setlocale(LC_ALL, "").
    0:000> g
    Breakpoint 0 hit
    *** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\g++.exe
    eax=755e9640 ebx=0073fea0 ecx=0073fea0 edx=00000002 esi=007e4528 edi=007e4570
    eip=75587640 esp=0073fdec ebp=0073fec8 iopl=0         nv up ei pl nz na pe nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
    msvcrt!setlocale:
    75587640 6a18            push    18h
    0:000> dd @esp
    0073fdec  00438a77 00000000 00504c2c 75577e92
    0073fdfc  0073fea4 007e4528 007e4570 0073fea0
    0073fe0c  0040c68d 00000002 007e4528 00080000
    0073fe1c  0073fea4 0073fea8 7558823f 00000080
    0073fe2c  0073fea0 007e4528 007e4570 0073fea0
    0073fe3c  004aa04d 00000002 007e4528 007e3ab4
    0073fe4c  0073fe58 75598775 755e931c 0073fe98
    0073fe5c  75597c3d 00000008 7558817d 75588163
    0:000> da 504c2c
    00504c2c  ""
  2. The signature of libiconv_open is iconv_t libiconv_open(const char* tocode, const char* fromcode). The address of its first parameter const char* tocode is @esp + 4 and its value is 0x0073faf0, which points to the string "CP65001//TRANSLIT". The address of its second parameter const char* fromcode is @esp + 8 and its value is 0x0073fb10, which points to the string "UTF-8". So what g++ does here is libiconv_open("CP65001//TRANSLIT", "UTF-8").
    0:000> g
    Breakpoint 1 hit
    eax=0073fb10 ebx=01073f76 ecx=0073faf7 edx=0073faf0 esi=005211ac edi=007e2110
    eip=7cad4590 esp=0073fadc ebp=0073fb68 iopl=0         nv up ei pl nz na po nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
    libiconv_2!libiconv_open:
    7cad4590 55              push    ebp
    0:000> dd @esp
    0073fadc  0045f628 0073faf0 0073fb10 00000007
    0073faec  7558e170 35365043 2f313030 4152542f
    0073fafc  494c534e 00520054 0000002f 00000005
    0073fb0c  00000000 2d465455 00000038 00000005
    0073fb1c  0045f52f 010740c9 0051480e 00514804
    0073fb2c  0073fb4c 0073fb48 00000007 0073fb10
    0073fb3c  0073faf0 0000004c 0849f830 0073fb68
    0073fb4c  000001f8 007e0000 00000000 0000004c
    0:000> da 73faf0
    0073faf0  "CP65001//TRANSLIT"
    0:000> da 73fb10
    0073fb10  "UTF-8"

As you can see, libiconv works correctly converting the encodings. The actual problem is that setlocale conflicts with fputc and fputs. Therefore, we can replace the caller of setlocale with nop instructions to solve the problem.

  1. Find the address of the caller by backtracing the stack frames:
    0:000> .restart
    ...
    0:000> bl
    0:000> bp msvcrt!setlocale
    0:000> bl
    0 e 75587640     0001 (0001)  0:**** msvcrt!setlocale
    0:000> g
    Breakpoint 0 hit
    *** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\g++.exe
    eax=755e9640 ebx=0073fea0 ecx=0073fea0 edx=00000002 esi=01104528 edi=01104570
    eip=75587640 esp=0073fdec ebp=0073fec8 iopl=0         nv up ei pl nz na pe nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
    msvcrt!setlocale:
    75587640 6a18            push    18h
    0:000> k
    ChildEBP RetAddr
    0073fde8 00438a77     msvcrt!setlocale
    WARNING: Stack unwind information not available. Following frames may be wrong.
    0073fec8 00401396     g__+0x38a77
    0073ff68 75ef7ba9     g__+0x1396
    0073ff84 772fb79b     KERNEL32!BaseThreadInitThunk+0x19
    0073ffdc 772fb71f     ntdll!__RtlUserThreadStart+0x2b
    0073ffec 00000000     ntdll!_RtlUserThreadStart+0x1b
  2. The RetAddr of setlocale is 0x00438a77, which is the address of the instruction right after the caller. So the address of the caller is 0x00438a72.
    0:000> u 438a77 L-5
    g__+0x38a72:
    00438a72 e889b10600      call    g__+0xa3c00 (004a3c00)
    00438a77 c74424042d4c5000 mov     dword ptr [esp+4],offset g__+0x104c2d (00504c2d)
  3. Replace the caller with nop instructions:
    0:000> .restart
    ...
    0:000> u 438a72
    *** WARNING: Unable to verify timestamp for C:\Program Files\TDM-GCC-64\bin\g++.exe
    g__+0x38a72:
    00438a72 e889b10600      call    g__+0xa3c00 (004a3c00)
    00438a77 c74424042d4c5000 mov     dword ptr [esp+4],offset g__+0x104c2d (00504c2d)
    00438a7f c70424484c5000  mov     dword ptr [esp],offset g__+0x104c48 (00504c48)
    00438a86 e8a5680200      call    g__+0x5f330 (0045f330)
    00438a8b c70424484c5000  mov     dword ptr [esp],offset g__+0x104c48 (00504c48)
    00438a92 e8097c0200      call    g__+0x606a0 (004606a0)
    00438a97 c704244c4c5000  mov     dword ptr [esp],offset g__+0x104c4c (00504c4c)
    00438a9e e8cd690200      call    g__+0x5f470 (0045f470)
    0:000> eb 438a72 90 90 90 90 90
    0:000> u 438a72
    g__+0x38a72:
    00438a72 90              nop
    00438a73 90              nop
    00438a74 90              nop
    00438a75 90              nop
    00438a76 90              nop
    00438a77 c74424042d4c5000 mov     dword ptr [esp+4],offset g__+0x104c2d (00504c2d)
    00438a7f c70424484c5000  mov     dword ptr [esp],offset g__+0x104c48 (00504c48)
    00438a86 e8a5680200      call    g__+0x5f330 (0045f330)
    0:000> g
    用法:g++ [选项] 文件...
    选项:
    -pass-exit-codes         在某一阶段退出时返回其中最高的错误码。
    --help                   显示此帮助说明。
    --target-help            显示目标机器特定的命令行选项。
    --help={common|optimizers|params|target|warnings|[^]{joined|separate|undocumented}}[,...]。
                           显示特定类型的命令行选项。
    (使用‘-v --help’显示子进程的命令行参数)。
    ...

Solution

Finally, it's time to modify g++.exe!

  1. Run objdump -h g++.exe. The VMA of .text is 0x00401000, and its File off is 0x00000400.
    
    g++.exe:     file format pei-i386

Sections: Idx Name Size VMA LMA File off Algn 0 .text 000adbd8 00401000 00401000 00000400 24 CONTENTS, ALLOC, LOAD, READONLY, CODE, DATA 1 .data 00000804 004af000 004af000 000ae000 25 CONTENTS, ALLOC, LOAD, DATA 2 .rdata 0006cde0 004b0000 004b0000 000aea00 25 CONTENTS, ALLOC, LOAD, READONLY, DATA 3 .bss 000051d4 0051d000 0051d000 00000000 25 ALLOC 4 .idata 00001688 00523000 00523000 0011b800 22 CONTENTS, ALLOC, LOAD, DATA 5 .CRT 00000038 00525000 00525000 0011d000 22 CONTENTS, ALLOC, LOAD, DATA 6 .tls 00000008 00526000 00526000 0011d200 22 CONTENTS, ALLOC, LOAD, DATA 7 .rsrc 000004e8 00527000 00527000 0011d400 22 CONTENTS, ALLOC, LOAD, DATA 8 .reloc 0000d4f0 00528000 00528000 0011da00 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA

2. Calculate the `File off` of the caller metioned before.

0x438a72 - 0x401000 + 0x400 = 0x37e72


3. Replace the `0x37e72-0x37e76`th bytes of `g++.exe` with `0x90`s.

# Remaining problem(s)
1. See the picture below. Maybe the problem is related to escape sequence strings.
![image](https://github.com/jmeubank/tdm-gcc/assets/110970449/2eb58455-8dc6-45bf-92b8-6a32edcdbaf7)
CFSO6459 commented 11 months ago

The encoding issue is still there. This is how it looks like in PowerShell (similar situation in cmd):

PS D:\> gcc --help
ó?·¨£ogcc.exe [????] ???t...
????£o
  -pass-exit-codes         ?ú?3ò??×??í?3?ê±·μ?????D×???μ?′í?ó???£
  --help                   ??ê?′?°??ú?μ?÷?£
  --target-help            ??ê???±ê?ú?÷ì??¨μ??üá?DD?????£
  --help={common|optimizers|params|target|warnings|[^]{joined|separate|undocumented}}[,...]?£
                           ??ê?ì??¨ààDíμ??üá?DD?????£
 £¨ê1ó???-v --help?ˉ??ê?×ó??3ìμ??üá?DD2?êy£??£
  --version                ??ê?±àò??÷°?±?D??¢?£
  -dumpspecs               ??ê??ùóD?ú?¨ spec ×?·?′??£
  -dumpversion             ??ê?±àò??÷μ?°?±?o??£
  -dumpmachine             ??ê?±àò??÷μ???±ê′|àí?÷?£
  -print-search-dirs       ??ê?±àò??÷μ????÷?·???£
  -print-libgcc-file-name  ??ê?±àò??÷°é???aμ???3??£
  -print-file-name=<?a>    ??ê? <?a> μ?íê???·???£
  -print-prog-name=<3ìDò>  ??ê?±àò??÷×é?t <3ìDò> μ?íê???·???£
  -print-multiarch         ??ê???±êμ?±ê×? GNU èy?a×飨±?ó?óú?a?·??μ?ò?2?·?£??£
  -print-multi-directory   ??ê?2?í?°?±? libgcc μ??ù?????£
  -print-multi-lib         ??ê??üá?DD????oí?à??°?±??a???÷?·????μ?ó3é??£
  -print-multi-os-directory ??ê?2ù×÷?μí3?aμ??à???·???£
  -print-sysroot           ??ê???±ê?a?????£
  -print-sysroot-headers-suffix ??ê?ó?óú?°?òí·???tμ? sysroot oó×o?£
  -Wa,<????>               ???oo?·???μ? <????> ′?μY????±à?÷?£
  -Wp,<????>               ???oo?·???μ? <????> ′?μY???¤′|àí?÷?£
  -Wl,<????>               ???oo?·???μ? <????> ′?μY??á′?ó?÷?£
  -Xassembler <2?êy>       ?? <2?êy> ′?μY????±à?÷?£
  -Xpreprocessor <2?êy>    ?? <2?êy> ′?μY???¤′|àí?÷?£
  -Xlinker <2?êy>          ?? <2?êy> ′?μY??á′?ó?÷?£
  -save-temps              2?é?3y?D?????t?£
  -save-temps=<2?êy>       2?é?3y?D?????t?£
  -no-canonical-prefixes   éú3é???? gcc ×é?tμ??à???·??ê±2?éú3é1?·??ˉμ?
                           ?°×o?£
  -pipe                    ê1ó?1üμà′úì?áùê±???t?£
  -time                    ?a????×ó??3ì??ê±?£
  -specs=<???t>            ó? <???t> μ??úèY?2???ú?¨μ? specs ???t?£
  -std=<±ê×?>              ?ù?¨ê?è??′???t×??-???¨μ?±ê×??£
  --sysroot=<????>         ?? <????> ×÷?aí·???toí?a???tμ??ù?????£
  -B <????>                ?? <????> ìí?óμ?±àò??÷μ????÷?·???D?£
  -v                       ??ê?±àò??÷μ÷ó?μ?3ìDò?£
  -###                     ó? -v àà??£?μ?????±?òyo?à¨×?£?2¢?ò2??′DD?üá??£
  -E                       ??×÷?¤′|àí£?2???DD±àò??¢??±à?òá′?ó?£
  -S                       ±àò?μ???±àó???£?2???DD??±àoíá′?ó£?
  -c                       ±àò??¢??±àμ???±ê′ú??£?2???DDá′?ó?£
  -o <???t>                ê?3?μ? <???t>?£
  -pie                     éú3é?ˉì?á′?óμ??????T1??é?′DD???t?£
  -shared                  éú3éò???12?í?a?£
  -x <ó???>                ???¨??oóê?è????tμ?ó????£
                           ?êDíμ?ó???°üਣoc?¢c++?¢assembler?¢none
                           ??none?ˉòa??×????′??è?DD?a£??′?ù?Y???tμ?à??1??2?2a
                           ?′???tμ?ó????£

ò? -g?¢-f?¢-m?¢-O?¢-W ?ò --param ?aí·μ???????óé gcc.exe ×??ˉ′?μY????μ÷ó?μ?
 2?í?×ó??3ì?£è?òa?ò?aD???3ì′?μY????????£?±?D?ê1ó? -W<×???> ?????£

±¨??3ìDòè±?Yμ?2??è??2???£o
<https://github.com/jmeubank/tdm-gcc/issues>.

ChatGPT give a temporary fix, by dumping the output to a file:

PS D:\> gcc --help > output.txt

And this is how it looks like in the notepad:

用法:gcc.exe [选项] 文件...
选项:
  -pass-exit-codes         在某一阶段退出时返回其中最高的错误码。
  --help                   显示此帮助说明。
  --target-help            显示目标机器特定的命令行选项。
  --help={common|optimizers|params|target|warnings|[^]{joined|separate|undocumented}}[,...]。
                           显示特定类型的命令行选项。
 (使用‘-v --help’显示子进程的命令行参数)。
  --version                显示编译器版本信息。
  -dumpspecs               显示所有内建 spec 字符串。
  -dumpversion             显示编译器的版本号。
  -dumpmachine             显示编译器的目标处理器。
  -print-search-dirs       显示编译器的搜索路径。
  -print-libgcc-file-name  显示编译器伴随库的名称。
  -print-file-name=<库>    显示 <库> 的完整路径。
  -print-prog-name=<程序>  显示编译器组件 <程序> 的完整路径。
  -print-multiarch         显示目标的标准 GNU 三元组(被用于库路径的一部分)。
  -print-multi-directory   显示不同版本 libgcc 的根目录。
  -print-multi-lib         显示命令行选项和多个版本库搜索路径间的映射。
  -print-multi-os-directory 显示操作系统库的相对路径。
  -print-sysroot           显示目标库目录。
  -print-sysroot-headers-suffix 显示用于寻找头文件的 sysroot 后缀。
  -Wa,<选项>               将逗号分隔的 <选项> 传递给汇编器。
  -Wp,<选项>               将逗号分隔的 <选项> 传递给预处理器。
  -Wl,<选项>               将逗号分隔的 <选项> 传递给链接器。
  -Xassembler <参数>       将 <参数> 传递给汇编器。
  -Xpreprocessor <参数>    将 <参数> 传递给预处理器。
  -Xlinker <参数>          将 <参数> 传递给链接器。
  -save-temps              不删除中间文件。
  -save-temps=<参数>       不删除中间文件。
  -no-canonical-prefixes   生成其他 gcc 组件的相对路径时不生成规范化的
                           前缀。
  -pipe                    使用管道代替临时文件。
  -time                    为每个子进程计时。
  -specs=<文件>            用 <文件> 的内容覆盖内建的 specs 文件。
  -std=<标准>              假定输入源文件遵循给定的标准。
  --sysroot=<目录>         将 <目录> 作为头文件和库文件的根目录。
  -B <目录>                将 <目录> 添加到编译器的搜索路径中。
  -v                       显示编译器调用的程序。
  -###                     与 -v 类似,但选项被引号括住,并且不执行命令。
  -E                       仅作预处理,不进行编译、汇编或链接。
  -S                       编译到汇编语言,不进行汇编和链接,
  -c                       编译、汇编到目标代码,不进行链接。
  -o <文件>                输出到 <文件>。
  -pie                     生成动态链接的位置无关可执行文件。
  -shared                  生成一个共享库。
  -x <语言>                指定其后输入文件的语言。
                           允许的语言包括:c、c++、assembler、none
                           ‘none’意味着恢复默认行为,即根据文件的扩展名猜测
                           源文件的语言。

以 -g、-f、-m、-O、-W 或 --param 开头的选项将由 gcc.exe 自动传递给其调用的
 不同子进程。若要向这些进程传递其他选项,必须使用 -W<字母> 选项。

报告程序缺陷的步骤请参见:
<https://github.com/jmeubank/tdm-gcc/issues>.
Shuangcheng-Ni commented 10 months ago

I've also solved the aforementioned remaining problem recently.

Remaining problem(s)

  1. See the picture below. Maybe the problem is related to escape sequence strings. image

It turns out that gcc/g++ outputs this message using KERNEL32!WriteFile. After looking into the assembly near the calling of this function, I've found out that we just need to nop an instruction here:

*** 56877,56883 ****
    435283:     89 df                   mov    %ebx,%edi
    435285:     83 c3 01                add    $0x1,%ebx
    435288:     83 f8 1f                cmp    $0x1f,%eax // probably determine whether a chunk ends?
!   43528b:     0f 86 e7 03 00 00       jbe    0x435678 // nop this instruction
    435291:     0f b6 2b                movzbl (%ebx),%ebp
    435294:     83 f9 1b                cmp    $0x1b,%ecx // whether the character is ESC
    435297:     0f 94 c2                sete   %dl
--- 56877,56888 ----
    435283:     89 df                   mov    %ebx,%edi
    435285:     83 c3 01                add    $0x1,%ebx
    435288:     83 f8 1f                cmp    $0x1f,%eax
!   43528b:     90                      nop
!   43528c:     90                      nop
!   43528d:     90                      nop
!   43528e:     90                      nop
!   43528f:     90                      nop
!   435290:     90                      nop
    435291:     0f b6 2b                movzbl (%ebx),%ebp
    435294:     83 f9 1b                cmp    $0x1b,%ecx
    435297:     0f 94 c2                sete   %dl

The File off of the instruction:

0x43528b - 0x401000 + 0x400 = 0x3468b

That's all! image

Meanwhile, let me answer another participant's question here. @CFSO6459 What's your [console]::InputEncoding, [console]::OutputEncoding and $OutputEncoding in powershell? It seems that the problem is related to the encoding settings of powershell rather than gcc/g++.