rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
99.05k stars 12.8k forks source link

illegal instruction crash in rust 1.78 #126195

Closed amircodota closed 5 months ago

amircodota commented 5 months ago

Since rust 1.78, windows produced binaries have a dependency on the api-ms-win-core-synch-l1-2-0.dll which seems to be missing on some installations of windows 10 and 11.

We've had reports from users of our binary that our latest version fails with error 0xc000001d

veera-sivarajan commented 5 months ago

@rustbot label -needs-triage +O-windows +A-linkage

ChrisDenton commented 5 months ago

I doubt this is a linkage issue as 0xc000001d is STATUS_ILLEGAL_INSTRUCTION. Meaning that an attempt was made to execute an illegal CPU instruction. See NTSTATUS values.

amircodota commented 5 months ago

Hmmm.... could this be that the missing DLL causes this issue? Like trying to jump to address 0x0?

Here's what I know -

  1. Rust 1.78 added the dependency on this DLL. Before 1.78, this dependency did not exist.
  2. On the machines in question, Dependency Walker shows only this specific dll is missing.

I am now build a new binary with rust 1.77.2. Will distribute it and report back if it fixed the issue or not

ChrisDenton commented 5 months ago

api-ms-win-core-synch-l1-2-0.dll is not a real DLL, it's an "api set". Windows will automatically redirect it to the real DLL. I strongly suspect that is not the issue, not least because it's listed by the documentation (e.g. https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-wakebyaddressall) and used by Synchronization.lib.

If you can use an affected machine directly then you could try using windbg to see what instruction is causing this.

amircodota commented 5 months ago

@ChrisDenton I'm not that familiar with windbg and native code debugging, but if you can provide instructions, I'll see if my user agrees to try it :-)

ChrisDenton commented 5 months ago

Ah, that might be more complicated. Can you get a backtrace from them? And their CPU information, if possible?

If they did want to have a go at windbg, it can be downloaded from here and there's a tutorial. However, it should be relatively simple in this case. You just run windbg and, from the file menu, click "Launch Executable" then find the exe to run. After that you just click the green "Go" button and when the exe crashes it should highlight the offending instruction in yellow.

amircodota commented 5 months ago

A quick update - I heard back from the user and it seems downgrading to rust 1.77.2 fixed the issue. Not sure if it is related to the missing DLL or not, but I'll try to keep digging.

amircodota commented 5 months ago

Here's the output from windbg. Hope it helps narrow it down :-)

bjorn3 commented 5 months ago

ud2 is an instruction that allows a program to intentionally crash. What is the function name of the function at TabNine_fa79b5ff476d1afe+0x95a5a8?

amircodota commented 5 months ago

can you guide me on how to figure this out?

This exe is compiled using x86_64-pc-windows-gnu with profile production

[profile.production]
inherits = "release"
lto = true
amircodota commented 5 months ago

Regarding cpu info - this is the cpu on the machine where it crashed

13th Gen Intel(R) Core(TM) i9-13950HX 2.20 GHz

amircodota commented 5 months ago

@bjorn3 @ChrisDenton I tried getting more info about where the process was failing.

With a debug build the problem did not reproduce. With a release build, but without lto and without strip, the issue still reproduced, but I do not get functions or source locations.

I'm not that familiar with windbg and native code debugging on windows.

If you guys can advise on how to understand which function is failing it would be great (like generating debug symbols during build, that windbg understands).

I'm compiling the exe file with x86_64-pc-windows-gnu, not sure if it matters or not.

bjorn3 commented 5 months ago

Would setting strip = "off" in the production profile and putting the .pdb file with debuginfo next to the exe before running the program work? While source debuginfo is disabled by default in release mode (you can enable it with debug = 1 or debug = 2 for full debuginfo, but it shouldn't be necessary for debugging this.), it should still contains symbols for your program and definitively will contain source debuginfo for the standard library.

amircodota commented 5 months ago

Hmmm....

I'm cross compiling from linux, and not using msvc. I do not think a pdb file is created at all. Right?

Do I have to compile using msvc?

bjorn3 commented 5 months ago

Are you using the x86_64-pc-windows-gnu target? If so you have to use gdb instead of windbg as debugger. MinGW and MSVC use different debuginfo formats with gdb only supporting the MinGW format and windbg and most other Windows debuggers only supporting the MSVC format.

amircodota commented 5 months ago

OK, I'll try to compile with msvc. Hopefully the issue will still reproduce

mati865 commented 5 months ago

As you have already discovered, 0xc000001d is commonly used for assertions. An additional DLL is very unlikely to be related. I would lean toward miscompilation from LLVM 18, which has been used since Rust 1.78, or a linker issue that was uncovered by this change.

Are you able to share the code or the binary with full debuginfo? FWIW recent Windbg versions should be able to read debuginfo from GNU toolchain.

Regarding PDB, I don’t know how well it works, but at least Git for Windows uses cv2pdb on binaries created with GNU tools, so it probably should be enough to load it into WinDbg or Visual Studio. IIRC, LLD can output it without additional tools by only adding a linker argument.

amircodota commented 5 months ago

@mati865 I'm afraid I cannot share the full code.

Good to know I can translate gnu debug info to pdb.

Right now I went the msvc route. I'm waiting to hear from the user if the issue reproduced with an msvc build. If it will not reproduce, will definitely try cv2pdb.

Thanks

amircodota commented 5 months ago

The issue reproduced with msvc compilation.

Here's the windbg output.

If I understand correctly - this is the failing function - https://github.com/tabnine/notify/blob/e63a681be7851e1a9506d533b17af1f374e4fdba/notify/src/windows.rs#L110

This is the stacktrace windbg reports

image

Any suggestions?

bjorn3 commented 5 months ago

Great! That function is still on the large side though. Can you disassemble this function and post the output somewhere? You should be able to do this yourself based on the executable you gave your user. According to stackoverflow loading the executable in windbg and running uf TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E may work. You don't actually need to hit the crash for this to work.

amircodota commented 5 months ago

Here it is

** WARNING: Unable to verify checksum for TabNine.exe
Flow analysis was incomplete, some code may be missing
TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E:
00007ff7`d8485460 55              push    rbp
00007ff7`d8485461 4157            push    r15
00007ff7`d8485463 4156            push    r14
00007ff7`d8485465 4155            push    r13
00007ff7`d8485467 4154            push    r12
00007ff7`d8485469 56              push    rsi
00007ff7`d848546a 57              push    rdi
00007ff7`d848546b 53              push    rbx
00007ff7`d848546c 4881ec78020000  sub     rsp,278h
00007ff7`d8485473 488dac2480000000 lea     rbp,[rsp+80h]
00007ff7`d848547b 48c785f0010000feffffff mov qword ptr [rbp+1F0h],0FFFFFFFFFFFFFFFEh
00007ff7`d8485486 488d7120        lea     rsi,[rcx+20h]
00007ff7`d848548a 488d4140        lea     rax,[rcx+40h]
00007ff7`d848548e 488985f8000000  mov     qword ptr [rbp+0F8h],rax
00007ff7`d8485495 488d4110        lea     rax,[rcx+10h]
00007ff7`d8485499 48894538        mov     qword ptr [rbp+38h],rax
00007ff7`d848549d 48898dc8010000  mov     qword ptr [rbp+1C8h],rcx
00007ff7`d84854a4 488d4160        lea     rax,[rcx+60h]
00007ff7`d84854a8 48894530        mov     qword ptr [rbp+30h],rax
00007ff7`d84854ac 488d5dc0        lea     rbx,[rbp-40h]
00007ff7`d84854b0 4c8b2db10cce00  mov     r13,qword ptr [TabNine!_imp_WaitForSingleObjectEx (00007ff7`d9166168)]
00007ff7`d84854b7 4c8da570010000  lea     r12,[rbp+170h]
00007ff7`d84854be 6690            nop

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60:
00007ff7`d84854c0 4889d9          mov     rcx,rbx
00007ff7`d84854c3 4889f2          mov     rdx,rsi
00007ff7`d84854c6 e8b5e3ffff      call    TabNine!ZN17crossbeam_channel7channel17Receiver$LT$T$GT$8try_recv17h3d50c73a3221f865E (00007ff7`d8483880)
00007ff7`d84854cb 0fb64dc0        movzx   ecx,byte ptr [rbp-40h]
00007ff7`d84854cf 83f904          cmp     ecx,4
00007ff7`d84854d2 756e            jne     TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xe2 (00007ff7`d8485542)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x74:
00007ff7`d84854d4 488b85c8010000  mov     rax,qword ptr [rbp+1C8h]
00007ff7`d84854db 488b4870        mov     rcx,qword ptr [rax+70h]
00007ff7`d84854df ba64000000      mov     edx,64h
00007ff7`d84854e4 41b801000000    mov     r8d,1
00007ff7`d84854ea 41ffd5          call    r13
00007ff7`d84854ed 85c0            test    eax,eax
00007ff7`d84854ef 75cf            jne     TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60 (00007ff7`d84854c0)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x91:
00007ff7`d84854f1 488b8dc8010000  mov     rcx,qword ptr [rbp+1C8h]
00007ff7`d84854f8 488b01          mov     rax,qword ptr [rcx]
00007ff7`d84854fb 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff7`d84854ff 4883f802        cmp     rax,2
00007ff7`d8485503 7414            je      TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xb9 (00007ff7`d8485519)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xa5:
00007ff7`d8485505 83f801          cmp     eax,1
00007ff7`d8485508 751e            jne     TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xc8 (00007ff7`d8485528)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xaa:
00007ff7`d848550a b201            mov     dl,1
00007ff7`d848550c 41b900ca9a3b    mov     r9d,3B9ACA00h
00007ff7`d8485512 e8d9680000      call    TabNine!ZN17crossbeam_channel7flavors4list16Channel$LT$T$GT$4send17hec544859517ad0d7E (00007ff7`d848bdf0)
00007ff7`d8485517 eb1c            jmp     TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd5 (00007ff7`d8485535)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xb9:
00007ff7`d8485519 b201            mov     dl,1
00007ff7`d848551b 41b900ca9a3b    mov     r9d,3B9ACA00h
00007ff7`d8485521 e8caaf0000      call    TabNine!ZN17crossbeam_channel7flavors4zero16Channel$LT$T$GT$4send17h6c8e04f60051ca98E (00007ff7`d84904f0)
00007ff7`d8485526 eb0d            jmp     TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd5 (00007ff7`d8485535)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xc8:
00007ff7`d8485528 b201            mov     dl,1
00007ff7`d848552a 41b900ca9a3b    mov     r9d,3B9ACA00h
00007ff7`d8485530 e80b920000      call    TabNine!ZN17crossbeam_channel7flavors5array16Channel$LT$T$GT$4send17hc01c72f9d61d6d4bE (00007ff7`d848e740)

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd5:
00007ff7`d8485535 3c02            cmp     al,2
00007ff7`d8485537 7487            je      TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60 (00007ff7`d84854c0)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xd9:
00007ff7`d8485539 2401            and     al,1
00007ff7`d848553b 7583            jne     TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0x60 (00007ff7`d84854c0)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xdd:
00007ff7`d848553d e9580e0000      jmp     TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xf3a (00007ff7`d848639a)  Branch

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xe2:
00007ff7`d8485542 440fb67dc1      movzx   r15d,byte ptr [rbp-3Fh]
00007ff7`d8485547 488d45c2        lea     rax,[rbp-3Eh]
00007ff7`d848554b f30f6f00        movdqu  xmm0,xmmword ptr [rax]
00007ff7`d848554f 0f10480e        movups  xmm1,xmmword ptr [rax+0Eh]
00007ff7`d8485553 660f7f45f0      movdqa  xmmword ptr [rbp-10h],xmm0
00007ff7`d8485558 0f114dfe        movups  xmmword ptr [rbp-2],xmm1
00007ff7`d848555c 488b45e0        mov     rax,qword ptr [rbp-20h]
00007ff7`d8485560 89c9            mov     ecx,ecx
00007ff7`d8485562 488d158f843401  lea     rdx,[TabNine!anon.00b2cd3685792a701dc9b37e05a736fa.15.llvm.6672299153682802553+0x538 (00007ff7`d97cd9f8)]
00007ff7`d8485569 48630c8a        movsxd  rcx,dword ptr [rdx+rcx*4]
00007ff7`d848556d 4801d1          add     rcx,rdx
00007ff7`d8485570 ffe1            jmp     rcx

TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E+0xf3a:
00007ff7`d848639a 488d0d876e3401  lea     rcx,[TabNine!anon.00b2cd3685792a701dc9b37e05a736fa.0.llvm.6672299153682802553 (00007ff7`d97cd228)]
00007ff7`d84863a1 4c8d05186f3401  lea     r8,[TabNine!anon.00b2cd3685792a701dc9b37e05a736fa.2.llvm.6672299153682802553 (00007ff7`d97cd2c0)]
00007ff7`d84863a8 ba28000000      mov     edx,28h
00007ff7`d84863ad e8deedcd00      call    TabNine!core::panicking::panic (00007ff7`d9165190)
bjorn3 commented 5 months ago

That prints only part of the function and misses the part where it crashes by a big margin.

amircodota commented 5 months ago

Is there a way for me to print out the entire function?

amircodota commented 5 months ago

I did u TabNine!ZN6notify7windows26ReadDirectoryChangesServer3run17h0bae0cade33fd636E L1000

Here is the output. Let me know if that helps:

https://gist.github.com/amircodota/9b1c59c0c0770e2c2da1bf43071a9aaf

amircodota commented 5 months ago

For reference, this is the same function in rust 1.77.2

https://gist.github.com/amircodota/17a5c098e571d29d290793b86ab752a8

bjorn3 commented 5 months ago

The crashing instruction is 00007ff7`a1326d9e 0f0b ud2 which is jumped to from 8 different locations. 3 of which are after handle_alloc_error (should be unreachable), 2 after raw_vec::capacity_overflow (should be unreachable) and finally one after Arc::drop_slow. This is a function which should return normally, so having ud2 after that is definitively a bad sign. Going a bit back a bit shows that the start of the basic block is at 00007ff7`a1326d50, which is jumped to from 00007ff7`a1326d18 7436 je [...], almost immediately before that is the call to ReadDirectoryChangesW, which is at https://github.com/tabnine/notify/blob/e63a681be7851e1a9506d533b17af1f374e4fdba/notify/src/windows.rs#L290-L299 in the original source. A bit after that is a transmute which caught my eye. Upon closer inspection it turns out to transmute from isize to Box<ReadDirectoryRequest>. While this itself is technically allowed, the resulting pointer inside the Box<ReadDirectoryRequest> has no provenance as integers do not carry provenance and transmutes preserve the provenance of the input. As the pointer has no provenance, any attempt to dereference the box will be UB. While previously LLVM wasn't told that this is not allowed, since rust 1.78 we do actually tell LLVM about this: https://github.com/rust-lang/rust/pull/121282

Try replacing https://github.com/tabnine/notify/blob/e63a681be7851e1a9506d533b17af1f374e4fdba/notify/src/windows.rs#L307 with let request: Box<ReadDirectoryRequest> = mem::transmute(request_p as *mut ReadDirectoryRequest); which does a int2ptr cast to recover the provenance "stashed away" by the previous Box::into_raw(request) as isize ptr2int cast. Or alternatively you can avoid the transmute using let request = Box::from_raw(request_p as *mut ReadDirectoryRequest);. Ideally you did even avoid the as isize right after calling Box::into_raw() and do the cast only when assigning overlapped.hEvent so you can avoid the int2ptr cast when calling Box::from_raw().

As for why you can't locally reproduce it, this code seems to run only when ReadDirectoryChangesW returns an error. It may be the case that you never get an error locally.

tl;dr: There is UB in the notify crate. It may still be the case that fixing this UB doesn't fix the observed behavior though.

amircodota commented 5 months ago

Thank you very much for the thorough explanation.

I'll try it out and let you know

amircodota commented 5 months ago

A quick update - I could not get the original user to try it out, but I was able to reproduce locally with some effort, and I verified that the suggested fix indeed solves the "illegal instruction" crash.

If anyone's interested - this is the fix

Closing this issue.

Really appreciate all the help I got from everyone :-)

0xpr03 commented 5 months ago

I'll add some noise to say: Thanks for debugging this mess, especially to bjorn3.