guevara / read-it-later

read it later
231 stars 0 forks source link

Windows dynamic linking depends on the active code page #11820

Open guevara opened 1 month ago

guevara commented 1 month ago

Windows dynamic linking depends on the active code page

https://ift.tt/wC5DZVQ



nullprogram.com/blog/2024/10/07/

Windows paths have been WTF-16-encoded for decades, but module names in the import tables of Portable Executable are octets. If a name contains values beyond ASCII — technically out of spec — then the dynamic linker must somehow decode those octets into Unicode in order to construct a lookup path. There are multiple ways this could be done, and the most obvious is the process’s active code page (ACP), which is exactly what happens. As a consequence, the specific DLL loaded by the linker may depend on the system code page. In this article I’ll contrive such a situation.

LoadLibraryA is a similar situation, and potentially applies the code page to a longer portion of the module path. LoadLibraryW is unaffected, at least for the directly-named module, because it’s Unicode all the way through.

For my contrived demonstration I came up with two names that to English-reading eyes appears as two words with extraneous markings:

  • õral.dll: CP-1252="C3 B5 …"
  • õral.dll: CP-1252="F5 …"; UTF-8="C3 B5 …"

Both end with ral.dll. I’ve included the CP-1252 encoding for the differing prefixes, and the UTF-8 encoding for the second. I’m using CP-1252 because it’s the most common system code page in the world, especially the Western hemisphere. Due to case insensitivity, the actual DLL may be named ãµral.dll — i.e. to match the second library case — but the module name must be encoded as uppercase when building the import library. Alternatively the second could be Õral.dll, particularly because I won’t use it when constructing an import library.

The plan is to store the octets C3 B5 … in the import table. A process using CP-1252 decodes it to õral.dll. In the UTF-8 code page it decodes to õral.dll. For testing we can use an application manifest to control the code page for a particular PE image — a lot easier than changing the system code page. Otherwise, this trick could dynamically change the behavior of a program in response to the system code page without actually inspecting the active code page.

The libraries will have a single function get, which returns a string indicating which library was loaded:

#define X(s) #s
#define S(s) X(s)
__declspec(dllexport) char *get(void) { return S(V); }

Constructing the import library can be tricky because you must consider how the toolchain, editors, and shells decode and encode text, which may involve the build system’s code page. It’s shockingly difficult to script! Binutils dlltool cannot process these names and cannot be used at all. With bleeding edge w64devkit I could reliably construct the DLLs and import library like so, even in a script (Windows 10 and later only):

$ gcc -shared -DV=UTF-8 -o Õral.dll  detect.c
$ gcc -shared -DV=ANSI  -o õral.dll detect.c -Wl,--out-implib=detect.lib

That produces two DLLs and one import library, detect.lib, with the desired module name octets. A straightforward MSVC cl invocation also works so long as it’s not from a batch file. It will quite correctly warn about the strange name situation, which I like. My test program, main.c:

#include <stdio.h>

char *get(void);
int main(void) { puts(get()); }

I link detect.lib when I build it:

$ cc -o main.exe main.c detect.lib

I designed peports to print non-ASCII octets unambiguously (\xXX), and it’s the only tool I know that does so:

$ peports main.exe | tail -n 2
\xc3\xb5ral.dll
        1       get

The module name has the C3 B5 … prefix octets. When I run it under my system code page, CP-1252:

If I add a UTF-8 manifest, even just a “side-by-side” manifest, it loads the other library despite an identical import table:

$ cc -o main.exe main.c detect.lib libwinsane.o
$ ./main
UTF-8

Again, without the manifest, if I switched my system code page to UTF-8 then UTF-8 would still be the result.

I can’t think of much practical use for this trick outside of malware. In a real program it would be simpler to inspect code page, and there’s no benefit to avoiding such a check if it’s needed. Malware could use it to trick inspection tools and scanners that decode module names differently than the dynamic linker. Such tools often incorrectly assume UTF-8, which is what motivated this article.







    via null program

    October 12, 2024 at 03:16PM