rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.18k stars 585 forks source link

JIT code seems to interfere with reverse execution. #3461

Open mibu138 opened 1 year ago

mibu138 commented 1 year ago

First of all I just want to thank you all very much for creating this incredible tool. It has completely changed the way I debug and I'm very much in the "can't ever go back" camp. Big big thanks. Now onto the question...

I need to debug an appication that uses both OpenGL and Vulkan. Since RR does not appear to play with well with GPU graphics/compute, I run the application using Mesa's llvmpipe as a software based driver. This works well enough in that it makes RR usable, but often times when I am running a reverse-continue to the next breakpoint I end up stepping over some driver code and things all of a sudden get very slow. This slowness is accompanied by printouts like this:

Unable to find JITed code entry at address: 0x7f56abfffba0

By slow I mean, I recorded an application running for about 1 and half minutes which ended in a crash. I run rr replay -e to get to the end, and then I set a breakpoint at a method that I know to be in the call stack that caused the crash. This call would have occurred within seconds of the crash. I run reverse-continue. I start seeing printouts like above. I get about 50 of those before I see the message

warning: Temporarily disabling breakpoints for unloaded shared library <path to shared library that my breakpoint was in>

Maybe a few minutes later I see this message

Traceback (most recent call last):
  File "<string>", line 550, in render
  File "<string>", line 1178, in lines
gdb.error: Dwarf Error: Can't read DWARF data in section .eh_frame [in module <in-memory>]

And i'm back at the gdb prompt. The whole process took about 40 minutes. I never got to the breakpoint.

So, I'm just really not sure what is going on here, other than it seems like the JIT code that is being run by the software driver is messing things up. I have been able to successfully debug issues with rr using this software backend at times but its hit or miss. I'm not sure, but I recall sometimes being able to step over the JIT stuff and come out the other side, but many times it does seem to take things off the rails.

Any idea what might be going on here or tips for working around this? Most of the time I'm not interested in looking at any of the driver code, and would be happy if that could just be ignored by rr.

System info: rr 5.6.0 gdb 12.1 linux kernel 6.2.1 CPU Intel i9-9900K x86_64 archlinux distro, up-to-date as of about 2 weeks ago.

rocallahan commented 1 year ago

My guess is that llvmpipe is using gdb's "JIT interface" to support debugging of the JITted code and gdb's JITted code support does not work with rr. https://llvm.org/docs/DebuggingJITedCode.html

You could try the Pernosco fork of gdb, which patches gdb to disable that JIT support: https://github.com/Pernosco/binutils-gdb/commits/pernosco-gdb

vchuravy commented 1 year ago

Interesting from the Julia side I often use rr with GDB's JIT interface, and sofar I haven't run into issues.

(One of my issues with pernosco has been that it discards the JIT information)

mibu138 commented 1 year ago

I very much appreciate the fast response. I did an A/B comparison running the same replay that I described above and using the pernosco-gdb version does seem to fix the issue. So that is awesome.

Are there plans to work with gdb's JITted code support eventually? Or is this more of an issue with GDB or LLVM?

@vchuravy that is odd, since it sounds like it should not work based on @rocallahan 's response. I'm guessing Julie is also using LLVM for the JIT code generation?

vchuravy commented 1 year ago

I'm guessing Julie is also using LLVM for the JIT code generation?

Yes. I often record Julia sessions with ENABLE_GDBLISTENER=1 rr record julia since that allows for symbolization of backtraces through JITed code. I have also used RR to debug the actual JIT (set a watchpoint to a codepage and reverse execute until the JIT emits that codepage).

Do you know which llvm JIT version llvmpipe uses? MCJIT, OrcV1 or OrcV2 (RTDYLD or JITLink)? Julia currently uses OrcV2 with RTDYLD, but we are moving on to JITLink.

mibu138 commented 1 year ago

It appears they are using MCJIT, based off some grepping of their source tree.