llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.37k stars 12.14k forks source link

[LLD][COFF] LIB-imported symbols seem to have weaker precedence than LIB-defined symbols #82050

Open aganea opened 9 months ago

aganea commented 9 months ago

I found an interesting issue in LLD that I was chasing down for more than a year. The problem revolves around XInput1_4.dll. I wanted to share how I got there, since others might see this problem before we land a fix.

First signs

At first, we wanted to use rust-lld.exe along with our Rust monorepo. This worked locally...

(in .cargo/config.toml)

[target.x86_64-pc-windows-msvc]
linker = "rust-lld"

...but not remotely. Our local machines were on Windows 10 while the remote build system was on Windows Server 2019 cloud VMs. We got super-weird errors, and only on certain Rust crates. For example this one:

image

Another time, it was on the sqlx-macros crate. Note its dependencies:

sqlx-macros v0.6.1 (proc-macro) (C:\src\github.com\launchbadge\sqlx\sqlx-macros)
├── dotenvy v0.15.1
│   └── dirs v4.0.0
│       └── dirs-sys v0.3.7
│           └── winapi v0.3.9

Since our monorepo uses workspaces, feature unification would kick in. We also happened to be experimenting with Bevy, so the the winapi crate would be compiled with all the features used in the workspace, such as "xinput" even though sqlx-macros didn't really need it here.

Strangely the build only failed when running headless, through a SSH session. When a user was connected on the VM, the build would succeed. We managed to capture minidumps of the failure, and inspection was showing an error deep down some core Microsoft libraries. At first we suspected an ABI incompatibility on Windows Server 2019. However it seemed that it had to do with XINPUT1_4.dll. Essentially the DLL load graph was rust.exe -> sqlx_macros-XXX.dll -> XINPUT1_4.DLL -> inputhost.dll -> CoreMessaging.dll then a call to RaiseFailFastException().

image (3)

I left this aside for a while, since build times weren't that much of an issue. We were planning on upgrading the VMs to Windows Server 2022 later that year, so I postponed the rust-lld update.

Again

At some point I wanted to upgrade our C++ codebase to using clang-cl and lld-link. Things worked locally, although our unit tests failed on the preflight remotely on the VMs. They were failing on startup with code exit status 0xc0000409. At first I thought this was a crash in our code caused by UB with clang-cl (this happened to me at Ubisoft during our first clang-cl integration). Strangely I wasn't getting this error locally. I got taken with other things, so I decided to look at this later on.

We ended up upgrading our VMs to Windows Server 2022, and during my spare time between other things, I tried preflighting my clang-cl upgrade again. Same thing, same error. I was quite puzzled, and I was starting to rule out the ABI incompatibility I though I had with Rust at first. It was inconceivable that the same failure would be carried on several CRT updates and Windows OS upgrades.

Yet, again.

Months later (now) I got a new Windows 11 install. I wanted to test LLVM 18.0 and un-shelved again my clang-cl upgrade for our C++ codebase. After fixing a number of compilation errors, on the first boot our program complained about not finding XInput1_3.dll. Installing that in C:\Windows\System fixed that particular issue. Now it was crashing on XInput1_4.dll:

crash_xinput_1

I did not immediately make the connection with the past issues I had with Rust compilation. There was something strange about this callstack but I couldn't put my finger on it. While looking for clues on the Internet, I stomped on this thread. I immediately made the connection with the Rust build issue I had a year ago! However there wasn't really a resolution, just a suggestion to replace XInput1_4.dll with XInput1_3.dll.

What.

Do what??

So yeah. Duplicating XInput1_3.dll and renaming it to XInput1_4.dll fixed the issue. :unamused: No more crashes.

I just couldn't leave it to that, I decided to dig a bit more to understand the problem. Putting back XInput1_4.dll as it was, restarting and stepping through the asm code manually reveled that it had something to do with calls that XInput was doing to the Windows Event Tracing API. It failed because one of the HANDLEs in the structure used to initialize the evntprov.h API was already initialized (see the highlighted value in the Memory window).

crash_xinput_2

How could that be? Was it a memory corruption? But the HANDLE looked legit, so that structure must have been already initialized somewhere else. Maybe the DLL was loaded elsewhere? Let's put a breakpoint in that function, see if it was called before. Indeed:

crash_xinput_3

Looking at the memory window, the location where the HANDLE is hasn't been written yet. Also the callstack shows a different codepath when entering DllMain. I was able to confirm that this was happening after the call to LoadLibrary, but before the entry point of my DLL was executed. Essentially, at this point the Windows OS loader was initializing the DLL dependency graph.

Now if I hit Run again, the program will stop at the same location a second time. That's where the bug happens. Looking at the callstack again, something fishy appears to me: why is the entry point of my DLL calling XInput1_4.DLL!DllMain() ? Looking at the assembly:

crash_xinput_4

Looks legit. But then, stepping inside:

crash_xinput_5

Wait a second. Why is that an import symbol??

That location is an IAT entry that points to XInput1_4.DLL!DllMain(). At that point I was pretty sure that I was experiencing the side-effect of a bug in LLD. I recompiled LLD from source, quickly stepped through it while it was linking my DLL, and then it become clear what the problem was.

The bug and a repro

Essentially, XInput1_4.dll exports its DllMain symbol (why?). This seems to be a mistake on Microsoft's end. But luckily that triggers our bug in LLD :smile: XInput.lib appears first on the command-line, then /DEFAULTLIB:msvcrt.lib where the default (weak) DllMain function is. Due to the lazy nature of how the inputs are parsed in LLD, the XInput1_4.dll!DllMain symbol is placed first in the symbol table, while scanning all the inputs. Then comes the msvcrt.lib DllMain symbol which does not overwrite the symbol table entry, since it's already there. Later, when another function pulls on DllMain, LLD pulls on the OBJ inside XInput.lib, and replaces that symbol table entry with its (XInput import lib) import symbol.

The repro is quite straightforward:

// b1.cpp
extern "C" int f();
int b() {
    return f();
}

// b2.cpp
extern "C" int f() {
    return 0;
}

// a.cpp
extern "C" __declspec(dllexport) int f() {
    return 1;
}

// main.cpp
extern int b();
int main() {
    return b();
}

rem --- build.bat ---
cl /c /Z7 b1.cpp b2.cpp
lib /out:b.lib b1.obj b2.obj

cl /LD /Z7 a.cpp

cl /c /Z7 main.cpp

rem set LINKER=c:\git\llvm-project\stage1_msvc_debug\bin\lld-link
set LINKER=link

%LINKER% /debug /nodefaultlib /out:main.exe main.obj a.lib b.lib /entry:main
rem --- end of build.bat ---

// After building, with MSVC link.exe the output is:
> main.exe
> echo %ERRORLEVEL%
0

// With LLD the output is:
> main.exe
> echo %ERRORLEVEL%
1

MSVC somehow seems to prefer a fully defined symbol rather than an imported symbol. Another possibility is that it tends to favor a symbol inside the same .LIB rather than one defined elsewhere. At this point, if someone from the MSVC team could comment, that would be very helpful. I have a tentative fix, but I'm not sure what the "good" behavior should be.

We could categorize this as the "annoying" class of bugs with possibly a workaround. However given how annoying it was to find, it might be that other folks will see different inceptions of it, so ideally I'd prefer fixing it in LLD.

@mstorsjo @rnk @tru @sylvain-audi @compnerd @MaskRay

llvmbot commented 9 months ago

@llvm/issue-subscribers-lld-coff

Author: Alexandre Ganea (aganea)

I found an interesting issue in LLD that I was chasing down for more than a year. The problem revolves around `XInput1_4.dll`. I wanted to share how I got there, since others might see this problem before we land a fix. ### First signs At first, we wanted to use `rust-lld.exe` along with our Rust monorepo. This worked locally: ``` (in .cargo/config.toml) [target.x86_64-pc-windows-msvc] linker = "rust-lld" ``` But remotely, it wouldn't work. Our local machines were on Windows 10 while the remote build system was on Windows Server 2019 cloud VMs. We got super-weird errors, and only on certain Rust crates. For example this one: ![image](https://github.com/llvm/llvm-project/assets/37383324/49bae0ff-8ab3-4e27-9337-32a32a788002) Another time, it was on the `sqlx-macros` crate, which has the following dependencies: ``` sqlx-macros v0.6.1 (proc-macro) (C:\src\github.com\launchbadge\sqlx\sqlx-macros) ├── dotenvy v0.15.1 │ └── dirs v4.0.0 │ └── dirs-sys v0.3.7 │ └── winapi v0.3.9 ``` Since our monorepo uses workspaces, [feature unification](https://doc.rust-lang.org/cargo/reference/features.html#feature-unification) would kick in. We also happened to be experimenting with [Bevy](https://bevyengine.org/), so the the `winapi` would contain all used features in the workspace, such as [`"xinput"`](https://github.com/retep998/winapi-rs/blob/0.3/Cargo.toml#L418) even though `sqlx-macros` didn't really need it here. Strangely the build only failed when running headless, through a SSH session. When a user was connected on the VM, the build would succeed. We managed to capture minidumps of the failure, and inspection was showing an error deep down some core Microsoft libraries. At first we suspected an ABI incompatibility on Windows Server 2019. However it seemed that it had to do with `XINPUT1_4.dll`. Essentially the DLL load graph was `rust.exe` -> `sqlx_macros-XXX.dll` -> `XINPUT1_4.DLL` -> `inputhost.dll ` -> `CoreMessaging.dll` then a call to `RaiseFailFastException()`. ![image (3)](https://github.com/llvm/llvm-project/assets/37383324/8391c76d-ca0c-4dc0-b57a-96e791ef8b67) I left this aside for a while, since build times weren't that much of an issue. We were planning on upgrading the VMs to Windows Server 2022 later that year, so I postponed the `rust-lld` update. ### Again At some point I wanted to upgrade our C++ codebase to using `clang-cl` and `lld-link`. Things worked locally, although our unit tests failed on the preflight remotely on the VMs. They were failing on startup with `code exit status 0xc0000409`. At first I thought this was a crashe in our code caused by UB with `clang-cl` (this happened to me at Ubisoft during our first `clang-cl` integration). Strangely I wasn't getting this error locally. I got taken with other things, so I decided to look at this later on. We ended up upgrading our VMs to Windows Server 2022, and during my spare time between other things, I tried preflighting my `clang-cl` upgrade again. Same thing, same error. I was quite puzzled, and I was starting to rule out the ABI incompatibility I though I had with Rust at first. It was inconceivable that the same failure would be carried on several CRT updates and Windows OS upgrades. ### Yet, again. Months later (now) I got a new Windows 11 install. I wanted to test LLVM 18.0 and un-shelved again my `clang-cl` upgrade for our C++ codebase. After fixing a number of compilation errors, on the first boot our program complained about not finding `XInput1_3.dll`. Installing that in `C:\Windows\System` fixed that particular issue. Now it was now crashing on `XInput1_4.dll`. <img width="1905" alt="crash_xinput_1" src="https://github.com/llvm/llvm-project/assets/37383324/1dd29ca8-5c0b-4c8a-bbd2-48ed4d0314c8"> I did not immediately make the connection with the past issues I had with Rust compilation. There was something strange about this callstack but I couldn't put my finger on it. While looking for clues on the Internet, I stomped on [this thread](https://github.com/bevyengine/bevy/issues/2921). I immediately made the connection with the Rust build issue I had a year ago! However there wasn't really a resolution, [just a suggestion](https://github.com/bevyengine/bevy/issues/2921#issuecomment-1207163473) to replace `XInput1_4.dll` with `XInput1_3.dll`. What. Do what?? So yeah. Duplicating `XInput1_3.dll` and renaming it to `XInput1_4.dll` fixed the issue. :unamused: No more crashes. I just couldn't leave it to that, I decided to dig a bit more to understand the problem. Putting back `XInput1_4.dll` as it was, restarting and stepping through the asm code manually reveled that it had something to do with calls that XInput was doing to the [Windows Event Tracing API](https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/wpp-software-tracing). It failed because one of the HANDLEs in the structure used to initialize the `evntprov.h` API was already initialized _(see the highlighted value in the Memory window)_. <img width="1908" alt="crash_xinput_2" src="https://github.com/llvm/llvm-project/assets/37383324/6aa3e9d4-aa76-448d-9698-95c3425b5ed9"> How could that be? Was it a memory corruption? But the HANDLE looked legit, so that structure must have been already initialized somewhere else. Maybe the DLL was loaded elsewhere? Let's put a breakpoint in that function, see if it was called before. Indeed: <img width="1901" alt="crash_xinput_3" src="https://github.com/llvm/llvm-project/assets/37383324/9aed535a-85f6-415e-8db8-3b990de3370e"> Looking at the memory window, the location where the HANDLE is hasn't been written yet. Also the callstack shows a different codepath when entering `DllMain`. I was able to confirm that this was happening after the call to `LoadLibrary`, but before the entry point for my DLL was executed. Essentially, at this point the Windows OS loader was initializing the DLL dependency graph. Now if I hit Run again, the program will stop at the same location a second time. That's where the bug happens. Looking at the callstack again, something fishy appears to me: why is the entry point of my DLL calling `XInput1_4.DLL!DllMain()` ? Looking at the assembly: <img width="1898" alt="crash_xinput_4" src="https://github.com/llvm/llvm-project/assets/37383324/d75de63c-a168-4d02-9c5d-c20b8a48611b"> Looks legit. But then, stepping inside: <img width="560" alt="crash_xinput_5" src="https://github.com/llvm/llvm-project/assets/37383324/8a0d5a58-0ccb-4190-9813-06ac457f2f5a"> Wait a second. Why is that an import symbol?? That location is an IAT entry that points to `XInput1_4.DLL!DllMain()`. At that point I was pretty sure that I was experiencing the side-effect of a bug in LLD. I recompiled LLD from source, quickly stepped through it while it was linking my DLL, and then it beome clear what the problem was. ### The bug and a repro Essentially, `XInput1_4.dll` exports its `DllMain` symbol (why?). This seems to be a mistake on Microsoft's end. But luckily that triggers our bug in LLD :smile: `XInput.lib` appears first on the command-line, then `/DEFAULTLIB:msvcrt.lib` where the default (weak) `DllMain` function is. Due to the lazy nature of how the inputs are parsed in LLD, the `XInput1_4.dll!DllMain` symbol is placed first in the symbol table, while scanning all the inputs. Then comes the `msvcrt.lib` `DllMain` symbol which does not overwrite the symbol table entry, since it's already there. Later, when another function pulls on `DllMain`, LLD pulls on the OBJ inside `XInput.lib`, and replaces that symbol table entry with its (XInput import lib) import symbol. The repro is quite straightforward: ``` // b1.cpp extern "C" int f(); int b() { return f(); } // b2.cpp extern "C" int f() { return 0; } // a.cpp extern "C" __declspec(dllexport) int f() { return 1; } // main.cpp extern int b(); int main() { return b(); } rem --- build.bat --- cl /c /Z7 b1.cpp b2.cpp lib /out:b.lib b1.obj b2.obj cl /LD /Z7 a.cpp cl /c /Z7 main.cpp rem set LINKER=c:\git\llvm-project\stage1_msvc_debug\bin\lld-link set LINKER=link %LINKER% /debug /nodefaultlib /out:main.exe main.obj a.lib b.lib /entry:main rem --- end of build.bat --- // After building, with MSVC link.exe the output is: > main.exe > echo %ERRORLEVEL% 0 // With LLD the output is: > main.exe > echo %ERRORLEVEL% 1 ``` MSVC somehow seems to prefer a fully defined symbol rather than an imported symbol. Another possibility is that it tends to favor a symbol inside the same .LIB rather than one defined elsewhere. At this point, if someone from the MSVC team could comment, that would be very helpful. I have a tentative fix, but I'm not sure what the "good" behavior should be. We could categorize this as the "annoying" class of bugs. However given how annoying it was to find, it might be that other folks will see different inceptions of it, so ideally I'd prefer fixing it in LLD. @mstorsjo @rnk @tru @sylvain-audi @compnerd @MaskRay
aganea commented 9 months ago

At this point, if someone from the MSVC team could comment, that would be very helpful. I have a tentative fix, but I'm not sure what the "good" behavior should be.

@TartanLlama Could you possibly help here? Thanks!

TartanLlama commented 9 months ago

@aganea sure, I'll send an email to the linker team

aganea commented 9 months ago

@aganea sure, I'll send an email to the linker team

Thanks a lot!

TartanLlama commented 9 months ago

Pasting a response from Grant Richins on the linker team:

It is not a preference for import/export versus fully defined. It is all about order of discovery and order of inputs (and it is confusingly documented).

When the search begins, there is only one unresolved symbol: “b” referenced in main.obj. Then a.lib is searched (because it is first). Next b.lib is searched and a definition for “b” is found, which causes b.lib(b1.obj) to get loaded and now “f” is added as an unresolved symbol. The search continues with where it left off, and finds “f” in b.lib and then declares complete.

This is documented here: LINK Input Files | Microsoft Learn

Object files on the command line are processed in the order they appear on the command line. Libraries are searched in command line order as well, with the following caveat: Symbols that are unresolved when bringing in an object file from a library are searched for in that library first, and then the following libraries from the command line and /DEFAULTLIB (Specify default library) directives, and then to any libraries at the beginning of the command line.

So basically new unresolved externals do not restart searching from the beginning, they start with wherever the linker is currently searching.

To see this explicitly you can simply you can split b.lib into b1.lib, and b2.lib, and set the order to “main.obj b1.lib a.lib b2.lib” and then the behavior matches.

aganea commented 9 months ago

@TartanLlama Thank you to you both! It'd be really great if you could open-source the link.exe (unit)tests eventually, if that's ever possible!

aganea commented 8 months ago

Hello @mstorsjo! I've crafted a fix for this issue, but it breaks this test. You did introduce that test in: https://github.com/llvm/llvm-project/commit/28212dfce6a245c644bba799092e12f5c3b8a5ac

I have a feeling that your initial issue was caused by the symbol searching behavior described above which previously wasn't matching what MSVC was doing. I can fix that test, unless we should keep that previous behavior for MinGW mode?

mstorsjo commented 8 months ago

Hello @mstorsjo! I've crafted a fix for this issue, but it breaks this test. You did introduce that test in: 28212df

I have a feeling that your initial issue was caused by the symbol searching behavior described above which previously wasn't matching what MSVC was doing. I can fix that test, unless we should keep that previous behavior for MinGW mode?

No, I don't think there's a reason to try to keep/replicate the exact behaviour we had. LLD generally deviates more than this from ld.bfd (which would be the reference for MinGW mode behaviour) overall, so corner case behaviours like this shouldn't really matter, and the exact behaviour we had probably isn't the exact thing we should strive for anyway. I think the previous behaviour was just a product of what the LLD implementation looked like at the time.

If it's no longer possible to trigger this case, I guess it can be reasonable to remove the testcase.

Thanks for looping me in!

YanzheL commented 5 months ago

This issue was also observed when building Unreal Engine 5 game editor using clang-cl and lld-link. The compiled game editor program cannot start due to XInput1_4.dll crash. However, using clang-cl with link.exe is good.

Tested LLVM version: 17.0.6, 18.1.0, 18.1.7, 19.1.0