Open hotsphink opened 7 years ago
Try running with syscallbuf disabled rr record -n
? Futzing with register state for the signal handler may be possible, but I'm not entirely convinced it's a good idea.
For sanity we'd have to completely unwind all register values. We'd also want to handle changes to the register values. I think it's doable. @Keno, what worries you about it?
I guess there would be some weirdness where according to the signal handler's stack we're doing a plain syscall but in the syscallbuf code we might be doing a different syscall or not in the kernel at all. But that wouldn't break anything that isn't already broken.
I suppose that if a signal handler tries to manipulate registers to alter the syscall restart, it's going to be disappointed and angry, whereas today it just might work.
It might be possible to exit from the syscallbuf code to a trampoline which calls the signal handler with the right register state, and after sigreturn
if it needs to restart the syscall, reenters the syscallbuf again. We'd need to have rr
redirect the sigreturn
back to the trampoline since we wouldn't want to return to the RIP in the signal frame.
It would be a pain to implement though since I think it would have to complicate replay.
Yes, I'm mostly worried about signal handlers that try to modify register values. There is also a concern about what happens if the signal handler longjmps out of there. Though I guess that doesn't work today either.
If we did what I suggested above, longjmp-out would actually work.
It could be kinda nice since it would mean we never have to worry about reentering the syscallbuf. No nested descheds, no worries about running user code on the syscallbuf alt-stack.
However, it would be a large change to a fragile part of the system.
Now that I know what's going on, I can't say this is a huge deal for me. It is rare that I'd want rr and the profiler running at the same time, and disabling the syscallbuf seems like a fine workaround. The main problem is that rr is just too good (too seamless) these days, so a discrepancy like this is unexpected and is therefore harder to diagnose and understand. I just wish it could be more obvious that something is up.
At least for this case, relying on replay to detect this would be fine. If replay knew that it was feeding bogus values to a replayed signal handler, it could... uh, do something. Transmit a warning through the gdb connection or replace them with 0xdeadbeef and then produce a different error message if divergence was detected or... ok, maybe I just jumped the shark.
Heh. I happen to be working on something that requires running with the profiler on, and of course, I had forgotten about this bug. Running rr record -n is fine, if but it would be nice if it could somehow tell me that they're incompatible.
I've run into this problem again, but this time -n
wasn't good enough. I was trying to run firefox to debug a crash in JS stackwalking, but was getting the wrong crash when running without -n
. With -n
, it was very slow (expected), and was not able to load pages (unexpected), which was necessary for the reproduction steps. I tried running with -n -h
in hopes that chaos mode would schedule things in a way that would allow progress, but no luck.
For https://bugzilla.mozilla.org/show_bug.cgi?id=1322559 I was trying to record a --disable-profiling build of Firefox with the (new) Gecko Profiler enabled ( https://raw.githubusercontent.com/mstange/Gecko-Profiler-Addon/master/gecko_profiler.xpi ). I was seeing a crash in GeckoSampler::doNativeBacktrace, which is actually what I wanted to see and debug, but it appears that it is behaving differently when rr is recording so it isn't the crash I was looking for. (To be clear, this is not a problem of divergence between record and replay; this is the recording affecting the initial run.)
What appears to be happening is that when a SIGPROF signal handler gets invoked, the stack pointer stored in its context argument is the stack pointer for rr's syscall hooking code, which is in a completely different stack from the actual executing program:
doNativeBacktrace grabs a chunk of the stack to memcpy, and ends up biting off more than it can chew -- I mean, access. I guess what I'd like it to do is give the signal handler the register state as of the "call" to _syscall_hook_trampoline?