llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.43k stars 11.75k forks source link

lld-link /delayload - first call of a function with bad floating point parameter on x64 #51941

Open aTom3333 opened 2 years ago

aTom3333 commented 2 years ago
Bugzilla Link 52599
Version 13.0
OS Windows NT
Attachments Source files to reproduce the bug

Extended Description

When linking a program with a DLL using the /delayload switch, the first call to a function defined in the DLL will get bad value for (at least one of) the floating point parameters.

Attached are 2 sources file my_lib.cpp and my_exe.cpp to reproduce the bug. They should be built as folow:

When running my_exe.exe, the output will be "1 0 3" instead of the expected "1 2 3".

The last step can be replaced with "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64\link.exe" my_dll.lib Delayimp.lib /delayload:my_dll.dll my_exe.obj /OUT:my_exe.exe to use link.exe with the same options or with "C:\Program Files\LLVM\bin\lld-link.exe" my_dll.lib my_exe.obj /OUT:my_exe.exe to use lld with /delayload. In both of those cases the resulting executable will give the expected "1 2 3".

I believe the bug occurs because __delayLoadHelper2 (the function defined in delayimp.lib that actually loads the DLL and locate the function we want to call during the first usage) writes into the top of the stack space of its caller (I don't know why, is it a weird Windows caling convention?) but the thunk generated by lld doesn't that space.

Specifically, the thunk generated by lld (for x64) looks like this: push rcx
push rdx
push r8
push r9
sub rsp,48h
movdqa xmmword ptr [rsp],xmm0
movdqa xmmword ptr [rsp+10h],xmm1
movdqa xmmword ptr [rsp+20h],xmm2
movdqa xmmword ptr [rsp+30h],xmm3
mov rdx,rax
lea rcx,[__xt_z+28h (01401C9E88h)]
call __delayLoadHelper2 (01401A3464h)
movdqa xmm0,xmmword ptr [rsp]
movdqa xmm1,xmmword ptr [rsp+10h]
movdqa xmm2,xmmword ptr [rsp+20h]
movdqa xmm3,xmmword ptr [rsp+30h]
add rsp,48h
pop r9
pop r8
pop rdx
pop rcx
jmp rax

(it allocates space on the stack and uses it to save the register prior to calling __delayLoadHelper2 and restore them later)

Whereas the thunk generated by link.exe looked like that: mov qword ptr [rsp+8],rcx
mov qword ptr [rsp+10h],rdx
mov qword ptr [rsp+18h],r8
mov qword ptr [rsp+20h],r9
sub rsp,68h
movdqa xmmword ptr [rsp+20h],xmm0
movdqa xmmword ptr [rsp+30h],xmm1
movdqa xmmword ptr [rsp+40h],xmm2
movdqa xmmword ptr [rsp+50h],xmm3
mov rdx,rax
lea rcx,[DELAY_IMPORT_DESCRIPTOR_my_dll (0140435020h)]
call
delayLoadHelper2 (01400089C2h)
movdqa xmm0,xmmword ptr [rsp+20h]
movdqa xmm1,xmmword ptr [rsp+30h]
movdqa xmm2,xmmword ptr [rsp+40h]
movdqa xmm3,xmmword ptr [rsp+50h]
mov rcx,qword ptr [rsp+70h]
mov rdx,qword ptr [rsp+78h]
mov r8,qword ptr [rsp+80h]
mov r9,qword ptr [rsp+88h]
add rsp,68h
jmp __tailMerge_my_dll+77h (01402237B8h)
jmp rax

It looks very similar but, for some reason, it doesn't save the xmmX register on the top of the stack like lld, it leave 32 bytes that __delayLoadHelper2 is free to mess with.

Indeed, (at least on my machine), the first 2 instruction of __delayLoadHelper2 are: mov qword ptr [rsp+10h],rbx
mov qword ptr [rsp+18h],rsi

which, if I'm not mistaken are writting into the stack space where xmm0 and xmm1 were saved.

DeChambord commented 1 year ago

Hi, we are having the exact same problem with Visual Studio 2019, Windows SDK 10.0.19041.0, using lld-link as the linker. Any news on this one? I've noticed the assembly code that is used to wrap the delayload library is hard-coded in https://github.com/llvm/llvm-project/blob/main/lld/COFF/DLL.cpp#L198 . Is it assuming a different calling convention from __delayLoadHelper2, or is it a bug in the code?

Anyway, for our internal use, we are considering replacing it with something like follows, so it matches the link.exe behavior, and avoids the xmm overwrite. It seem to fix the corruption problem :

static const uint8_t tailMergeX64[] = {
    0x48, 0x89, 0x4C, 0x24, 0x08,               // mov         qword ptr [rsp+8],rcx  
    0x48, 0x89, 0x54, 0x24, 0x10,               // mov         qword ptr [rsp+10h],rdx  
    0x4C, 0x89, 0x44, 0x24, 0x18,               // mov         qword ptr [rsp+18h],r8  
    0x4C, 0x89, 0x4C, 0x24, 0x20,               // mov         qword ptr [rsp+20h],r9  
    0x48, 0x83, 0xEC, 0x68,                     // sub         rsp,68h  
    0x66, 0x0F, 0x7F, 0x44, 0x24, 0x20,         // movdqa      xmmword ptr [rsp+20h],xmm0
    0x66, 0x0F, 0x7F, 0x4C, 0x24, 0x30,         // movdqa      xmmword ptr [rsp+30h],xmm1
    0x66, 0x0F, 0x7F, 0x54, 0x24, 0x40,         // movdqa      xmmword ptr [rsp+40h],xmm2
    0x66, 0x0F, 0x7F, 0x5C, 0x24, 0x50,         // movdqa      xmmword ptr [rsp+50h],xmm3
    0x48, 0x8B, 0xD0,                           // mov          rdx, rax
    0x48, 0x8D, 0x0D, 0, 0, 0, 0,               // lea          rcx, [___DELAY_IMPORT_...]
    0xE8, 0, 0, 0, 0,                           // call         __delayLoadHelper2
    0x66, 0x0F, 0x6F, 0x44, 0x24, 0x20,         // movdqa      xmm0,xmmword ptr [rsp+20h]  
    0x66, 0x0F, 0x6F, 0x4C, 0x24, 0x30,         // movdqa      xmm1,xmmword ptr [rsp+30h]  
    0x66, 0x0F, 0x6F, 0x54, 0x24, 0x40,         // movdqa      xmm2,xmmword ptr [rsp+40h]  
    0x66, 0x0F, 0x6F, 0x5C, 0x24, 0x50,         // movdqa      xmm3,xmmword ptr [rsp+50h]  
    0x48, 0x8B, 0x4C, 0x24, 0x70,               // mov         rcx,qword ptr [rsp+70h]  
    0x48, 0x8B, 0x54, 0x24, 0x78,               // mov         rdx,qword ptr [rsp+78h]  
    0x4C, 0x8B, 0x84, 0x24, 0x80, 00, 00, 00,   // mov         r8,qword ptr [rsp+80h]  
    0x4C, 0x8B, 0x8C, 0x24, 0x88, 00, 00, 00,   // mov         r9,qword ptr [rsp+88h]  
    0x48, 0x83, 0xC4, 0x68,                     // add         rsp,68h
    0xFF, 0xE0,                                 // jmp         rax
};

as well as replacing the hardcoded offsets in TailMergeChunkX64::writeTo :

    write32le(buf + 54, desc->getRVA() - rva - 58);
    write32le(buf + 59, helper->getRVA() - rva - 63);

Edit : After a bit of research (I'm not very familiar with playing with ABI in the first place ^^" ), it seems that this 32 byte space corresponds to the "Shadow Space" of Microsoft x64 calling convention, and it seem the code I've sampled as a workaround is using it as well (as opposed to pushing or popping the arguments like the original code). So I'm not sure how generic it is, a more minimal fix would be to keep the pushing and popping of arguments...

santagada commented 1 year ago

Bumping this as it seems like a pretty high priority bug to fix as it seems like an ABI violation, so just want to bring it to the attention of the lld maintainers.

KevinEady commented 6 months ago

[Please see the above-mentioned issue for further discussions]

Chiming in on this one.

After several days of debugging an unexpectedly behaving program, @martenrichter and I encountered this exact same issue, as seen with this test source has a a failed GitHub Action. (Note the signature for this function: napi_status napi_create_double(napi_env env, double value, napi_value* result))

This issue affects Node native addons written in C++ using cmake-js and the ClangCL toolset, as cmake-js uses delay load imports to fix support with Electron (see PR for details, as I'm not personally familiar with this).

vladmikhalin commented 2 months ago

Can confirm, trying to build a native electron module with clang-cl results in unpredictable floating point values passed to node api. A fix would be very welcome.