Open aganea opened 9 months ago
@llvm/issue-subscribers-lld-coff
Author: Alexandre Ganea (aganea)
At this point, if someone from the MSVC team could comment, that would be very helpful. I have a tentative fix, but I'm not sure what the "good" behavior should be.
@TartanLlama Could you possibly help here? Thanks!
@aganea sure, I'll send an email to the linker team
@aganea sure, I'll send an email to the linker team
Thanks a lot!
Pasting a response from Grant Richins on the linker team:
It is not a preference for import/export versus fully defined. It is all about order of discovery and order of inputs (and it is confusingly documented).
When the search begins, there is only one unresolved symbol: “b” referenced in main.obj. Then a.lib is searched (because it is first). Next b.lib is searched and a definition for “b” is found, which causes b.lib(b1.obj) to get loaded and now “f” is added as an unresolved symbol. The search continues with where it left off, and finds “f” in b.lib and then declares complete.
This is documented here: LINK Input Files | Microsoft Learn
Object files on the command line are processed in the order they appear on the command line. Libraries are searched in command line order as well, with the following caveat: Symbols that are unresolved when bringing in an object file from a library are searched for in that library first, and then the following libraries from the command line and /DEFAULTLIB (Specify default library) directives, and then to any libraries at the beginning of the command line.
So basically new unresolved externals do not restart searching from the beginning, they start with wherever the linker is currently searching.
To see this explicitly you can simply you can split b.lib into b1.lib, and b2.lib, and set the order to “main.obj b1.lib a.lib b2.lib” and then the behavior matches.
@TartanLlama Thank you to you both!
It'd be really great if you could open-source the link.exe
(unit)tests eventually, if that's ever possible!
Hello @mstorsjo! I've crafted a fix for this issue, but it breaks this test. You did introduce that test in: https://github.com/llvm/llvm-project/commit/28212dfce6a245c644bba799092e12f5c3b8a5ac
I have a feeling that your initial issue was caused by the symbol searching behavior described above which previously wasn't matching what MSVC was doing. I can fix that test, unless we should keep that previous behavior for MinGW mode?
Hello @mstorsjo! I've crafted a fix for this issue, but it breaks this test. You did introduce that test in: 28212df
I have a feeling that your initial issue was caused by the symbol searching behavior described above which previously wasn't matching what MSVC was doing. I can fix that test, unless we should keep that previous behavior for MinGW mode?
No, I don't think there's a reason to try to keep/replicate the exact behaviour we had. LLD generally deviates more than this from ld.bfd (which would be the reference for MinGW mode behaviour) overall, so corner case behaviours like this shouldn't really matter, and the exact behaviour we had probably isn't the exact thing we should strive for anyway. I think the previous behaviour was just a product of what the LLD implementation looked like at the time.
If it's no longer possible to trigger this case, I guess it can be reasonable to remove the testcase.
Thanks for looping me in!
This issue was also observed when building Unreal Engine 5 game editor using clang-cl and lld-link. The compiled game editor program cannot start due to XInput1_4.dll crash. However, using clang-cl with link.exe is good.
Tested LLVM version: 17.0.6, 18.1.0, 18.1.7, 19.1.0
I found an interesting issue in LLD that I was chasing down for more than a year. The problem revolves around
XInput1_4.dll
. I wanted to share how I got there, since others might see this problem before we land a fix.First signs
At first, we wanted to use
rust-lld.exe
along with our Rust monorepo. This worked locally......but not remotely. Our local machines were on Windows 10 while the remote build system was on Windows Server 2019 cloud VMs. We got super-weird errors, and only on certain Rust crates. For example this one:
Another time, it was on the
sqlx-macros
crate. Note its dependencies:Since our monorepo uses workspaces, feature unification would kick in. We also happened to be experimenting with Bevy, so the the
winapi
crate would be compiled with all the features used in the workspace, such as"xinput"
even thoughsqlx-macros
didn't really need it here.Strangely the build only failed when running headless, through a SSH session. When a user was connected on the VM, the build would succeed. We managed to capture minidumps of the failure, and inspection was showing an error deep down some core Microsoft libraries. At first we suspected an ABI incompatibility on Windows Server 2019. However it seemed that it had to do with
XINPUT1_4.dll
. Essentially the DLL load graph wasrust.exe
->sqlx_macros-XXX.dll
->XINPUT1_4.DLL
->inputhost.dll
->CoreMessaging.dll
then a call toRaiseFailFastException()
.I left this aside for a while, since build times weren't that much of an issue. We were planning on upgrading the VMs to Windows Server 2022 later that year, so I postponed the
rust-lld
update.Again
At some point I wanted to upgrade our C++ codebase to using
clang-cl
andlld-link
. Things worked locally, although our unit tests failed on the preflight remotely on the VMs. They were failing on startup withcode exit status 0xc0000409
. At first I thought this was a crash in our code caused by UB withclang-cl
(this happened to me at Ubisoft during our firstclang-cl
integration). Strangely I wasn't getting this error locally. I got taken with other things, so I decided to look at this later on.We ended up upgrading our VMs to Windows Server 2022, and during my spare time between other things, I tried preflighting my
clang-cl
upgrade again. Same thing, same error. I was quite puzzled, and I was starting to rule out the ABI incompatibility I though I had with Rust at first. It was inconceivable that the same failure would be carried on several CRT updates and Windows OS upgrades.Yet, again.
Months later (now) I got a new Windows 11 install. I wanted to test LLVM 18.0 and un-shelved again my
clang-cl
upgrade for our C++ codebase. After fixing a number of compilation errors, on the first boot our program complained about not findingXInput1_3.dll
. Installing that inC:\Windows\System
fixed that particular issue. Now it was crashing onXInput1_4.dll
:I did not immediately make the connection with the past issues I had with Rust compilation. There was something strange about this callstack but I couldn't put my finger on it. While looking for clues on the Internet, I stomped on this thread. I immediately made the connection with the Rust build issue I had a year ago! However there wasn't really a resolution, just a suggestion to replace
XInput1_4.dll
withXInput1_3.dll
.What.
Do what??
So yeah. Duplicating
XInput1_3.dll
and renaming it toXInput1_4.dll
fixed the issue. :unamused: No more crashes.I just couldn't leave it to that, I decided to dig a bit more to understand the problem. Putting back
XInput1_4.dll
as it was, restarting and stepping through the asm code manually reveled that it had something to do with calls that XInput was doing to the Windows Event Tracing API. It failed because one of the HANDLEs in the structure used to initialize theevntprov.h
API was already initialized (see the highlighted value in the Memory window).How could that be? Was it a memory corruption? But the HANDLE looked legit, so that structure must have been already initialized somewhere else. Maybe the DLL was loaded elsewhere? Let's put a breakpoint in that function, see if it was called before. Indeed:
Looking at the memory window, the location where the HANDLE is hasn't been written yet. Also the callstack shows a different codepath when entering
DllMain
. I was able to confirm that this was happening after the call toLoadLibrary
, but before the entry point of my DLL was executed. Essentially, at this point the Windows OS loader was initializing the DLL dependency graph.Now if I hit Run again, the program will stop at the same location a second time. That's where the bug happens. Looking at the callstack again, something fishy appears to me: why is the entry point of my DLL calling
XInput1_4.DLL!DllMain()
? Looking at the assembly:Looks legit. But then, stepping inside:
Wait a second. Why is that an import symbol??
That location is an IAT entry that points to
XInput1_4.DLL!DllMain()
. At that point I was pretty sure that I was experiencing the side-effect of a bug in LLD. I recompiled LLD from source, quickly stepped through it while it was linking my DLL, and then it become clear what the problem was.The bug and a repro
Essentially,
XInput1_4.dll
exports itsDllMain
symbol (why?). This seems to be a mistake on Microsoft's end. But luckily that triggers our bug in LLD :smile:XInput.lib
appears first on the command-line, then/DEFAULTLIB:msvcrt.lib
where the default (weak)DllMain
function is. Due to the lazy nature of how the inputs are parsed in LLD, theXInput1_4.dll!DllMain
symbol is placed first in the symbol table, while scanning all the inputs. Then comes themsvcrt.lib
DllMain
symbol which does not overwrite the symbol table entry, since it's already there. Later, when another function pulls onDllMain
, LLD pulls on the OBJ insideXInput.lib
, and replaces that symbol table entry with its (XInput import lib) import symbol.The repro is quite straightforward:
MSVC somehow seems to prefer a fully defined symbol rather than an imported symbol. Another possibility is that it tends to favor a symbol inside the same .LIB rather than one defined elsewhere. At this point, if someone from the MSVC team could comment, that would be very helpful. I have a tentative fix, but I'm not sure what the "good" behavior should be.
We could categorize this as the "annoying" class of bugs with possibly a workaround. However given how annoying it was to find, it might be that other folks will see different inceptions of it, so ideally I'd prefer fixing it in LLD.
@mstorsjo @rnk @tru @sylvain-audi @compnerd @MaskRay