Closed amircodota closed 5 months ago
@rustbot label -needs-triage +O-windows +A-linkage
I doubt this is a linkage issue as 0xc000001d
is STATUS_ILLEGAL_INSTRUCTION
. Meaning that an attempt was made to execute an illegal CPU instruction. See NTSTATUS values.
Hmmm.... could this be that the missing DLL causes this issue? Like trying to jump to address 0x0?
Here's what I know -
I am now build a new binary with rust 1.77.2. Will distribute it and report back if it fixed the issue or not
api-ms-win-core-synch-l1-2-0.dll
is not a real DLL, it's an "api set". Windows will automatically redirect it to the real DLL. I strongly suspect that is not the issue, not least because it's listed by the documentation (e.g. https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-wakebyaddressall) and used by Synchronization.lib
.
If you can use an affected machine directly then you could try using windbg to see what instruction is causing this.
@ChrisDenton I'm not that familiar with windbg and native code debugging, but if you can provide instructions, I'll see if my user agrees to try it :-)
Ah, that might be more complicated. Can you get a backtrace from them? And their CPU information, if possible?
If they did want to have a go at windbg, it can be downloaded from here and there's a tutorial. However, it should be relatively simple in this case. You just run windbg and, from the file menu, click "Launch Executable" then find the exe to run. After that you just click the green "Go" button and when the exe crashes it should highlight the offending instruction in yellow.
A quick update - I heard back from the user and it seems downgrading to rust 1.77.2 fixed the issue. Not sure if it is related to the missing DLL or not, but I'll try to keep digging.
Here's the output from windbg. Hope it helps narrow it down :-)
ud2
is an instruction that allows a program to intentionally crash. What is the function name of the function at TabNine_fa79b5ff476d1afe+0x95a5a8
?
can you guide me on how to figure this out?
This exe is compiled using x86_64-pc-windows-gnu with profile production
[profile.production]
inherits = "release"
lto = true
Regarding cpu info - this is the cpu on the machine where it crashed
13th Gen Intel(R) Core(TM) i9-13950HX 2.20 GHz
@bjorn3 @ChrisDenton I tried getting more info about where the process was failing.
With a debug build the problem did not reproduce. With a release build, but without lto and without strip, the issue still reproduced, but I do not get functions or source locations.
I'm not that familiar with windbg and native code debugging on windows.
If you guys can advise on how to understand which function is failing it would be great (like generating debug symbols during build, that windbg understands).
I'm compiling the exe file with x86_64-pc-windows-gnu, not sure if it matters or not.
Would setting strip = "off"
in the production profile and putting the .pdb file with debuginfo next to the exe before running the program work? While source debuginfo is disabled by default in release mode (you can enable it with debug = 1
or debug = 2
for full debuginfo, but it shouldn't be necessary for debugging this.), it should still contains symbols for your program and definitively will contain source debuginfo for the standard library.
Hmmm....
I'm cross compiling from linux, and not using msvc. I do not think a pdb file is created at all. Right?
Do I have to compile using msvc?
Are you using the x86_64-pc-windows-gnu target? If so you have to use gdb instead of windbg as debugger. MinGW and MSVC use different debuginfo formats with gdb only supporting the MinGW format and windbg and most other Windows debuggers only supporting the MSVC format.
OK, I'll try to compile with msvc. Hopefully the issue will still reproduce
As you have already discovered, 0xc000001d
is commonly used for assertions.
An additional DLL is very unlikely to be related. I would lean toward miscompilation from LLVM 18, which has been used since Rust 1.78, or a linker issue that was uncovered by this change.
Are you able to share the code or the binary with full debuginfo? FWIW recent Windbg versions should be able to read debuginfo from GNU toolchain.
Regarding PDB, I don’t know how well it works, but at least Git for Windows uses cv2pdb on binaries created with GNU tools, so it probably should be enough to load it into WinDbg or Visual Studio. IIRC, LLD can output it without additional tools by only adding a linker argument.
@mati865 I'm afraid I cannot share the full code.
Good to know I can translate gnu debug info to pdb.
Right now I went the msvc route. I'm waiting to hear from the user if the issue reproduced with an msvc build. If it will not reproduce, will definitely try cv2pdb.
Thanks
The issue reproduced with msvc compilation.
Here's the windbg output.
If I understand correctly - this is the failing function - https://github.com/tabnine/notify/blob/e63a681be7851e1a9506d533b17af1f374e4fdba/notify/src/windows.rs#L110
This is the stacktrace windbg reports
Any suggestions?
Great! That function is still on the large side though. Can you disassemble this function and post the output somewhere? You should be able to do this yourself based on the executable you gave your user. According to stackoverflow loading the executable in windbg and running uf TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E
may work. You don't actually need to hit the crash for this to work.
Here it is
** WARNING: Unable to verify checksum for TabNine.exe
Flow analysis was incomplete, some code may be missing
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E:
00007ff7`d8485460 55 push rbp
00007ff7`d8485461 4157 push r15
00007ff7`d8485463 4156 push r14
00007ff7`d8485465 4155 push r13
00007ff7`d8485467 4154 push r12
00007ff7`d8485469 56 push rsi
00007ff7`d848546a 57 push rdi
00007ff7`d848546b 53 push rbx
00007ff7`d848546c 4881ec78020000 sub rsp,278h
00007ff7`d8485473 488dac2480000000 lea rbp,[rsp+80h]
00007ff7`d848547b 48c785f0010000feffffff mov qword ptr [rbp+1F0h],0FFFFFFFFFFFFFFFEh
00007ff7`d8485486 488d7120 lea rsi,[rcx+20h]
00007ff7`d848548a 488d4140 lea rax,[rcx+40h]
00007ff7`d848548e 488985f8000000 mov qword ptr [rbp+0F8h],rax
00007ff7`d8485495 488d4110 lea rax,[rcx+10h]
00007ff7`d8485499 48894538 mov qword ptr [rbp+38h],rax
00007ff7`d848549d 48898dc8010000 mov qword ptr [rbp+1C8h],rcx
00007ff7`d84854a4 488d4160 lea rax,[rcx+60h]
00007ff7`d84854a8 48894530 mov qword ptr [rbp+30h],rax
00007ff7`d84854ac 488d5dc0 lea rbx,[rbp-40h]
00007ff7`d84854b0 4c8b2db10cce00 mov r13,qword ptr [TabNine!_imp_WaitForSingleObjectEx (00007ff7`d9166168)]
00007ff7`d84854b7 4c8da570010000 lea r12,[rbp+170h]
00007ff7`d84854be 6690 nop
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60:
00007ff7`d84854c0 4889d9 mov rcx,rbx
00007ff7`d84854c3 4889f2 mov rdx,rsi
00007ff7`d84854c6 e8b5e3ffff call TabNine!ZN17crossbeam_channel7channel17Receiver$LT$T$GT$8try_recv17h3d50c73a3221f865E (00007ff7`d8483880)
00007ff7`d84854cb 0fb64dc0 movzx ecx,byte ptr [rbp-40h]
00007ff7`d84854cf 83f904 cmp ecx,4
00007ff7`d84854d2 756e jne TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xe2 (00007ff7`d8485542) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x74:
00007ff7`d84854d4 488b85c8010000 mov rax,qword ptr [rbp+1C8h]
00007ff7`d84854db 488b4870 mov rcx,qword ptr [rax+70h]
00007ff7`d84854df ba64000000 mov edx,64h
00007ff7`d84854e4 41b801000000 mov r8d,1
00007ff7`d84854ea 41ffd5 call r13
00007ff7`d84854ed 85c0 test eax,eax
00007ff7`d84854ef 75cf jne TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60 (00007ff7`d84854c0) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x91:
00007ff7`d84854f1 488b8dc8010000 mov rcx,qword ptr [rbp+1C8h]
00007ff7`d84854f8 488b01 mov rax,qword ptr [rcx]
00007ff7`d84854fb 488b4908 mov rcx,qword ptr [rcx+8]
00007ff7`d84854ff 4883f802 cmp rax,2
00007ff7`d8485503 7414 je TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xb9 (00007ff7`d8485519) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xa5:
00007ff7`d8485505 83f801 cmp eax,1
00007ff7`d8485508 751e jne TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xc8 (00007ff7`d8485528) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xaa:
00007ff7`d848550a b201 mov dl,1
00007ff7`d848550c 41b900ca9a3b mov r9d,3B9ACA00h
00007ff7`d8485512 e8d9680000 call TabNine!ZN17crossbeam_channel7flavors4list16Channel$LT$T$GT$4send17hec544859517ad0d7E (00007ff7`d848bdf0)
00007ff7`d8485517 eb1c jmp TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd5 (00007ff7`d8485535) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xb9:
00007ff7`d8485519 b201 mov dl,1
00007ff7`d848551b 41b900ca9a3b mov r9d,3B9ACA00h
00007ff7`d8485521 e8caaf0000 call TabNine!ZN17crossbeam_channel7flavors4zero16Channel$LT$T$GT$4send17h6c8e04f60051ca98E (00007ff7`d84904f0)
00007ff7`d8485526 eb0d jmp TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd5 (00007ff7`d8485535) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xc8:
00007ff7`d8485528 b201 mov dl,1
00007ff7`d848552a 41b900ca9a3b mov r9d,3B9ACA00h
00007ff7`d8485530 e80b920000 call TabNine!ZN17crossbeam_channel7flavors5array16Channel$LT$T$GT$4send17hc01c72f9d61d6d4bE (00007ff7`d848e740)
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd5:
00007ff7`d8485535 3c02 cmp al,2
00007ff7`d8485537 7487 je TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60 (00007ff7`d84854c0) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd9:
00007ff7`d8485539 2401 and al,1
00007ff7`d848553b 7583 jne TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60 (00007ff7`d84854c0) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xdd:
00007ff7`d848553d e9580e0000 jmp TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xf3a (00007ff7`d848639a) Branch
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xe2:
00007ff7`d8485542 440fb67dc1 movzx r15d,byte ptr [rbp-3Fh]
00007ff7`d8485547 488d45c2 lea rax,[rbp-3Eh]
00007ff7`d848554b f30f6f00 movdqu xmm0,xmmword ptr [rax]
00007ff7`d848554f 0f10480e movups xmm1,xmmword ptr [rax+0Eh]
00007ff7`d8485553 660f7f45f0 movdqa xmmword ptr [rbp-10h],xmm0
00007ff7`d8485558 0f114dfe movups xmmword ptr [rbp-2],xmm1
00007ff7`d848555c 488b45e0 mov rax,qword ptr [rbp-20h]
00007ff7`d8485560 89c9 mov ecx,ecx
00007ff7`d8485562 488d158f843401 lea rdx,[TabNine!anon.00b2cd3685792a701dc9b37e05a736fa.15.llvm.6672299153682802553+0x538 (00007ff7`d97cd9f8)]
00007ff7`d8485569 48630c8a movsxd rcx,dword ptr [rdx+rcx*4]
00007ff7`d848556d 4801d1 add rcx,rdx
00007ff7`d8485570 ffe1 jmp rcx
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xf3a:
00007ff7`d848639a 488d0d876e3401 lea rcx,[TabNine!anon.00b2cd3685792a701dc9b37e05a736fa.0.llvm.6672299153682802553 (00007ff7`d97cd228)]
00007ff7`d84863a1 4c8d05186f3401 lea r8,[TabNine!anon.00b2cd3685792a701dc9b37e05a736fa.2.llvm.6672299153682802553 (00007ff7`d97cd2c0)]
00007ff7`d84863a8 ba28000000 mov edx,28h
00007ff7`d84863ad e8deedcd00 call TabNine!core::panicking::panic (00007ff7`d9165190)
That prints only part of the function and misses the part where it crashes by a big margin.
Is there a way for me to print out the entire function?
I did u TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E L1000
Here is the output. Let me know if that helps:
https://gist.github.com/amircodota/9b1c59c0c0770e2c2da1bf43071a9aaf
For reference, this is the same function in rust 1.77.2
https://gist.github.com/amircodota/17a5c098e571d29d290793b86ab752a8
The crashing instruction is 00007ff7`a1326d9e 0f0b ud2
which is jumped to from 8 different locations. 3 of which are after handle_alloc_error
(should be unreachable), 2 after raw_vec::capacity_overflow
(should be unreachable) and finally one after Arc::drop_slow
. This is a function which should return normally, so having ud2
after that is definitively a bad sign. Going a bit back a bit shows that the start of the basic block is at 00007ff7`a1326d50
, which is jumped to from 00007ff7`a1326d18 7436 je [...]
, almost immediately before that is the call to ReadDirectoryChangesW
, which is at https://github.com/tabnine/notify/blob/e63a681be7851e1a9506d533b17af1f374e4fdba/notify/src/windows.rs#L290-L299 in the original source. A bit after that is a transmute
which caught my eye. Upon closer inspection it turns out to transmute from isize
to Box<ReadDirectoryRequest>
. While this itself is technically allowed, the resulting pointer inside the Box<ReadDirectoryRequest>
has no provenance as integers do not carry provenance and transmutes preserve the provenance of the input. As the pointer has no provenance, any attempt to dereference the box will be UB. While previously LLVM wasn't told that this is not allowed, since rust 1.78 we do actually tell LLVM about this: https://github.com/rust-lang/rust/pull/121282
Try replacing https://github.com/tabnine/notify/blob/e63a681be7851e1a9506d533b17af1f374e4fdba/notify/src/windows.rs#L307 with let request: Box<ReadDirectoryRequest> = mem::transmute(request_p as *mut ReadDirectoryRequest);
which does a int2ptr cast to recover the provenance "stashed away" by the previous Box::into_raw(request) as isize
ptr2int cast. Or alternatively you can avoid the transmute using let request = Box::from_raw(request_p as *mut ReadDirectoryRequest);
. Ideally you did even avoid the as isize
right after calling Box::into_raw()
and do the cast only when assigning overlapped.hEvent
so you can avoid the int2ptr cast when calling Box::from_raw()
.
As for why you can't locally reproduce it, this code seems to run only when ReadDirectoryChangesW
returns an error. It may be the case that you never get an error locally.
tl;dr: There is UB in the notify crate. It may still be the case that fixing this UB doesn't fix the observed behavior though.
Thank you very much for the thorough explanation.
I'll try it out and let you know
A quick update - I could not get the original user to try it out, but I was able to reproduce locally with some effort, and I verified that the suggested fix indeed solves the "illegal instruction" crash.
If anyone's interested - this is the fix
Closing this issue.
Really appreciate all the help I got from everyone :-)
I'll add some noise to say: Thanks for debugging this mess, especially to bjorn3.
Since rust 1.78, windows produced binaries have a dependency on the
api-ms-win-core-synch-l1-2-0.dll
which seems to be missing on some installations of windows 10 and 11.We've had reports from users of our binary that our latest version fails with error
0xc000001d