rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.1k stars 580 forks source link

Recording Firefox with Gecko Profiler causes crash #1930

Open hotsphink opened 7 years ago

hotsphink commented 7 years ago

For https://bugzilla.mozilla.org/show_bug.cgi?id=1322559 I was trying to record a --disable-profiling build of Firefox with the (new) Gecko Profiler enabled ( https://raw.githubusercontent.com/mstange/Gecko-Profiler-Addon/master/gecko_profiler.xpi ). I was seeing a crash in GeckoSampler::doNativeBacktrace, which is actually what I wanted to see and debug, but it appears that it is behaving differently when rr is recording so it isn't the crash I was looking for. (To be clear, this is not a problem of divergence between record and replay; this is the recording affecting the initial run.)

What appears to be happening is that when a SIGPROF signal handler gets invoked, the stack pointer stored in its context argument is the stack pointer for rr's syscall hooking code, which is in a completely different stack from the actual executing program:

(rr) bt 14
#0  0x00007f9ea85830f5 in GeckoSampler::doNativeBacktrace(ThreadProfile&, TickSample*) (this=this@entry=0x7f9e6fca3ca0, aProfile=..., aSample=aSample@entry=0x681ff6d8) at /home/sfink/src/mozilla/tools/profiler/core/GeckoSampler.cpp:1139
#1  0x00007f9ea8583694 in GeckoSampler::InplaceTick(TickSample*) (this=0x7f9e6fca3ca0, sample=0x681ff6d8)
    at /home/sfink/src/mozilla/tools/profiler/core/GeckoSampler.cpp:1219
#2  0x00007f9ea857f209 in (anonymous namespace)::ProfilerSignalHandler(int, siginfo_t*, void*) (signal=<optimized out>, info=<optimized out>, context=0x681ff740) at /home/sfink/src/mozilla/tools/profiler/core/platform-linux.cc:252
#3  0x00007f9eb6c09c30 in <signal handler called> () at /lib64/libpthread.so.0
#4  0x000000007000000e in  ()
#5  0x00007f9eb700e233 in _raw_syscall () at /home/sfink/src/rr/src/preload/raw_syscall.S:120
#6  0x00007f9eb700b082 in untraced_syscall_base (syscallno=syscallno@entry=7, a0=a0@entry=140319650562601, a1=a1@entry=1, a2=a2@entry=-1, a3=a3@entry=0, a4=a4@entry=0, a5=0, syscall_instruction=0x7000000c) at /home/sfink/src/rr/src/preload/preload.c:334
#7  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:1753
#8  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:2402
#9  0x00007f9eb700c362 in syscall_hook (call=0x681fffa0) at /home/sfink/src/rr/src/preload/preload.c:2456
#10 0x00007f9eb700e26a in _syscall_hook_trampoline () at /home/sfink/src/rr/src/preload/syscall_hook.S:216
#11 0x00007f9eb700e293 in __morestack () at /home/sfink/src/rr/src/preload/syscall_hook.S:348
#12 0x00007f9eb700e2cd in _syscall_hook_trampoline_48_8b_3c_24 () at /home/sfink/src/rr/src/preload/syscall_hook.S:364
#13 0x00007f9eb5e82571 in poll () at /lib64/libc.so.6
#14 0x00007f9eaff93f80 in _xcb_conn_wait () at /lib64/libxcb.so.1

(rr) f 4
#4  0x000000007000000e in ?? ()
(rr) p/x $rsp
$14 = 0x681ffdf0
(rr) f 12
#12 0x00007f9eb700e2cd in _syscall_hook_trampoline_48_8b_3c_24 () at /home/sfink/src/rr/src/preload/syscall_hook.S:364
(rr) p/x $rsp
$17 = 0x681ffff0
(rr) f 13
#13 0x00007f9eb5e82571 in poll () from /lib64/libc.so.6
(rr) p/x $rsp
$18 = 0x7ffe68e53a90

doNativeBacktrace grabs a chunk of the stack to memcpy, and ends up biting off more than it can chew -- I mean, access. I guess what I'd like it to do is give the signal handler the register state as of the "call" to _syscall_hook_trampoline?

Keno commented 7 years ago

Try running with syscallbuf disabled rr record -n? Futzing with register state for the signal handler may be possible, but I'm not entirely convinced it's a good idea.

rocallahan commented 7 years ago

For sanity we'd have to completely unwind all register values. We'd also want to handle changes to the register values. I think it's doable. @Keno, what worries you about it?

rocallahan commented 7 years ago

I guess there would be some weirdness where according to the signal handler's stack we're doing a plain syscall but in the syscallbuf code we might be doing a different syscall or not in the kernel at all. But that wouldn't break anything that isn't already broken.

rocallahan commented 7 years ago

I suppose that if a signal handler tries to manipulate registers to alter the syscall restart, it's going to be disappointed and angry, whereas today it just might work.

rocallahan commented 7 years ago

It might be possible to exit from the syscallbuf code to a trampoline which calls the signal handler with the right register state, and after sigreturn if it needs to restart the syscall, reenters the syscallbuf again. We'd need to have rr redirect the sigreturn back to the trampoline since we wouldn't want to return to the RIP in the signal frame.

rocallahan commented 7 years ago

It would be a pain to implement though since I think it would have to complicate replay.

Keno commented 7 years ago

Yes, I'm mostly worried about signal handlers that try to modify register values. There is also a concern about what happens if the signal handler longjmps out of there. Though I guess that doesn't work today either.

rocallahan commented 7 years ago

If we did what I suggested above, longjmp-out would actually work.

rocallahan commented 7 years ago

It could be kinda nice since it would mean we never have to worry about reentering the syscallbuf. No nested descheds, no worries about running user code on the syscallbuf alt-stack.

However, it would be a large change to a fragile part of the system.

hotsphink commented 7 years ago

Now that I know what's going on, I can't say this is a huge deal for me. It is rare that I'd want rr and the profiler running at the same time, and disabling the syscallbuf seems like a fine workaround. The main problem is that rr is just too good (too seamless) these days, so a discrepancy like this is unexpected and is therefore harder to diagnose and understand. I just wish it could be more obvious that something is up.

At least for this case, relying on replay to detect this would be fine. If replay knew that it was feeding bogus values to a replayed signal handler, it could... uh, do something. Transmit a warning through the gdb connection or replace them with 0xdeadbeef and then produce a different error message if divergence was detected or... ok, maybe I just jumped the shark.

hotsphink commented 7 years ago

Heh. I happen to be working on something that requires running with the profiler on, and of course, I had forgotten about this bug. Running rr record -n is fine, if but it would be nice if it could somehow tell me that they're incompatible.

hotsphink commented 7 months ago

I've run into this problem again, but this time -n wasn't good enough. I was trying to run firefox to debug a crash in JS stackwalking, but was getting the wrong crash when running without -n. With -n, it was very slow (expected), and was not able to load pages (unexpected), which was necessary for the reproduction steps. I tried running with -n -h in hopes that chaos mode would schedule things in a way that would allow progress, but no luck.