llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.5k stars 11.78k forks source link

llvm-rc chokes on cp1252 #57367

Open apodtele opened 2 years ago

apodtele commented 2 years ago
[build] llvm-rc: Error in VERSIONINFO statement (ID 1):
[build] Non-ASCII 8-bit codepoint (169) can't occur in a non-Unicode string

This error was reported to us while compiling ftver.rc. The file specifies cp1252 which encodes © as "\251..." or decimal 169. This is all legal and understood by rc and windres. Non-Unicode strings are not limited to ASCII.

efriedma-quic commented 2 years ago

The current behavior is intentional; see https://github.com/llvm/llvm-project/blob/4e44c22c97be99f7c3864fb9fc68b3295835d837/llvm/tools/llvm-rc/ResourceFileWriter.cpp#L194 . See also https://devblogs.microsoft.com/oldnewthing/20190607-00/?p=102569 . It should work if you explicitly specify "/c1252" on the command-line.

CC @mstorsjo

efriedma-quic commented 2 years ago

(See also https://reviews.llvm.org/D46238)

mstorsjo commented 2 years ago

Yep. Contrary to rc.exe and GNU windres, llvm-rc (and llvm-windres) don't default to the current active code page, when running on Windows, but llvm-rc on Windows behaves exactly like it would on other OSes when cross compiling (where there's no concept of an active code page).

Additionally, llvm-rc doesn't unfortunately (yet) support the code page pragmas as are mentioned in that excellent blog post, so you do need to specify it on the command line.

apodtele commented 2 years ago

The BLOCK "040904E4" means cp1252. VALUE "Translation", 0x409, 1252 also means cp1252. Again, where does it say that non-Unicode string should be ASCII only??? I do not think that active code page has anything to do with it.

apodtele commented 2 years ago

Why do I need to specify on the command line what is already clearly and unambiguously stated in the file itself?

apodtele commented 2 years ago

Basically, you are confusing VERSIONINFO and STRINGTABLE resources. The former explicitly specifies encoding, the latter is indeed ASCII or Unicode.

mstorsjo commented 2 years ago

The BLOCK "040904E4" means cp1252. VALUE "Translation", 0x409, 1252 also means cp1252. Again, where does it say that non-Unicode string should be ASCII only??? Why do I need to specify on the command line what is already clearly and unambiguously stated in the file itself?

To the tools that assemble the resource data, those are mere strings of data that will be interpreted by whoever reads the resource in the end; neither GNU windres nor MS rc.exe look at them to determine how to convert non-Unicode strings into Unicode.

I do not think that active code page has anything to do with it.

It does. I do encourage you to try building this resource file with GNU windres or MS rc.exe on Windows with the default code page set to something else - I just did; I set set the "Current language for non-Unicode programs" setting to Japanese, so that the system's active code page is 932.

In that case, the narrow string "\251" gets encoded as the UTF16 character 0xFF69. If I pass -c1252 to rc.exe, or -c 1252 to GNU windres, I get the expected UTF16 char 0x00A9 in the output resource file, despite the current active code page.

We can also peek into e.g. the GNU windres source about this. See https://github.com/bminor/binutils-gdb/blob/master/binutils/winduni.c#L179-L185, where it defaults the codepages to CP_ACP. This is then later passed to wind_MultiByteToWideChar, which calls MultiByteToWideChar on Windows, and invokes iconv on other platforms (if available): https://github.com/bminor/binutils-gdb/blob/master/binutils/winduni.c#L746-L828 (Exactly what iconv does for CP_ACP in this case, I don't really know.)

If you prefix the string with L, i.e. L"\251 ...", then the escaped character is taken unambiguously as the Unicode code point 169, 0x00A9, as intended - with all three tools (MS rc.exe, GNU windres, llvm-rc) - regardless if you override the codepage with command line parameters or run the tools in environments with a different ACP.

I guess one could consider defaulting to CP 1252 instead of our current default (which only allows plain ASCII), since we don't do the lookup from ACP - that would probably be less disruptive than the current behaviour; I set this as default in the llvm-windres frontend when that was added.

apodtele commented 2 years ago

neither GNU windres nor MS rc.exe look at them to determine how to convert non-Unicode strings into Unicode.

In the case of VERSIONINFO the strings should be taken verbatim without any conversion. At least this is how I read https://docs.microsoft.com/en-us/windows/win32/menurc/stringtable with szKey also taken from BLOCK.

I am pretty sure that VERSIONINFO predates Unicode and is designed around explicit code pages. STRINGTABLE, on the other hand, has all these workarounds and pragma tricks.

mstorsjo commented 2 years ago

neither GNU windres nor MS rc.exe look at them to determine how to convert non-Unicode strings into Unicode.

In the case of VERSIONINFO the strings should be taken verbatim without any conversion. At least this is how I read https://docs.microsoft.com/en-us/windows/win32/menurc/stringtable with szKey also taken from BLOCK.

I am pretty sure that VERSIONINFO predates Unicode and is designed around explicit code pages.

That may very well be the case, but all current tools (MS rc.exe, GNU windres, llvm-rc) generate VERSIONINFO structs with all strings converted to unicode.

apodtele commented 2 years ago

Do you then suggest that we switch to BLOCK "040904B0" and prefix everything with L?

mstorsjo commented 2 years ago

Do you then suggest that we switch to BLOCK "040904B0" and prefix everything with L?

Prefixing the strings with L makes an output which is bitwise identical to what it was before (as long as the strings were interpreted as the right codepage) - I just tested it with GNU windres - so strictly speaking, doing such a change doesn't require switching the name of the versioninfo variant. I'm not entirely familiar with what logic windows uses there for resolving which versioninfo block to use etc, so I'm not sure if it's more correct or not to change the variant name, or if it makes any difference (better or worse).

apodtele commented 2 years ago

Can you check if rc alters szKey/BLOCK in the StringTable structure? If you are correct about using UTF-16 everywhere, it does not make sense to leave it L"040904E4".

mstorsjo commented 2 years ago

Can you check if rc alters szKey/BLOCK in the StringTable structure? If you are correct about using UTF-16 everywhere, it does not make sense to leave it L"040904E4".

I tested with MS rc.exe too; I added L in front of every string literal in the file, and it still produces a .res file that is bitwise identical to the one created from the original file. (GNU windres didn't like L in front of the string literals to BLOCK though, but with all other literals changed, it's also bitwise identical to the original with windres too.)

apodtele commented 2 years ago

The correct way to escape MultiByteToWideChar is L"\x00A9" which would be taken verbatim. The octals seem to be translated. I guess llvm-rc will take that too. I will also switch to "040904B0".

mstorsjo commented 2 years ago

The correct way to escape MultiByteToWideChar is L"\x00A9" which would be taken verbatim. The octals seem to be translated. I guess llvm-rc will take that too.

Yes, llvm-rc does support that kind of escaping too. However, I don't see what you mean about octals being translated - octal and hexadecimal escapes are treated exactly the same - it's just the difference between narrow and unicode string which changes whether they're interpreted according to the codepage or as unicode.

apodtele commented 2 years ago

Indeed. Going back to the original problem and having to convert narrow strings to UTF-16 anyway, I think defaulting to cp1252 makes perfect sense because it is so similar ISO8859-1.