rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.18k stars 585 forks source link

crash when using reverse-cont in Firefox replay session #1727

Open mjfroman opened 8 years ago

mjfroman commented 8 years ago

Debugging Firefox, I can repeatedly get the following assert/crash: [FATAL /home/mfroman/rr/rr/src/Task.cc:2021:write_bytes_helper() errno: EIO](task 19627 %28rec:19195%29 at time 820829) -> Assertion `nwritten == buf_size' failed to hold. Should have written 614400 bytes to 0x7f30c1e68000, but only wrote -1

Linux 4.2.0-36-generic #42~14.04.1-Ubuntu SMP Fri May 13 17:27:22 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux I'm on the lastest git commit for rr (3710623f9fbec963a8f8096092de1240f82a0091). Debugging on mozilla-central changeset: d98f20c25fee ./mach run --debugger=rr https://web.ciscospark.com I start a call on spark (currently debugging a video freeze), and then quit after I see the freeze. rr replay b rtcp_utility.cc:1516 b rtcp_receiver.cc:1107 c c reverse-cont boom!

Let me know if you need additional info, or if I can clarify anything.

rocallahan commented 8 years ago

I don't have anyone to call via Spark so I can't easily reproduce this.

Attaching gdb to the rr process and getting a stack trace for the assertion would be useful.

There should be an event number in the rr output; you can run rr dump -b -m -p and paste the contents around the given event number.

A log of the rr session with RR_LOG=ReplayTimeline,ReplaySession might also be useful.

sidkshatriya commented 8 years ago

I'm getting this error too at src/Task.cc:1911:write_bytes_helper() with errno: EIO

Assertion `nwritten == buf_size' failed to hold.
 Should have written 1 bytes to 0x7f31885a9ad7, but only wrote -1

This problem is occurring at 4.3.0 and some more recent versions like 3db52cd59

Like the poster of this bug, this only occurs when running rr in reverse mode.

I'm using rr to replay the php 7.0 executable. When I build php myself (and run rr in reverse mode) this error does not occur. When I use the php 7.0 cli packaged with ubuntu 16.04 however, I get a ton of these messages whenever I reverse...

rocallahan commented 8 years ago

This problem is occurring at 4.3.0 and some more recent versions like 3db52cd

Please try it on master. Event that revision is over a month old.

Can you describe simple steps to reproduce on Ubuntu?

sidkshatriya commented 8 years ago

FWIW I tried this on master (0b7360ab6) right now and I get the same error. Unfortunately I'm not able give you a simple test case you can reproduce.

In the spirit of https://github.com/mozilla/rr/issues/1803#issuecomment-247237824 are there some tests I could try running in reverse and check if I run across this same error? Of course these are precisely the kind of tests that might trigger this assert when run in reverse... you would have a better idea of which ones I could try...

rocallahan commented 8 years ago

Try running the entire testsuite. It includes many reverse-execution tests.

sidkshatriya commented 8 years ago

On master (0b7360a) and tag 4.3.0 I have 100% test suite pass (Ubuntu 16.04). So, it seems that the test suite may not be capturing this issue.

This executable is compiled with the following flags:

CFLAGS=-g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -O2 -Wall -pedantic -fsigned-char -fno-strict-aliasing -g

Shot in the dark: Might any of these flags (e.g. PIE) be the source of problems during reverse execution?

rocallahan commented 8 years ago

Wouldn't think so. If anything, -fstack-protector-strong is the most likely to be the problem.

When you get this error, what does /proc//maps show for the tid and address reported by rr?

sidkshatriya commented 8 years ago

I compiled the executable myself and -fstack-protector-strong does not seem to be causing any problems during reverse execution.

The PHP 7.0 executable packaged with ubuntu 16.04 does continue to give me these errors.

I'm not able to see /proc/maps as this is a fatal error and the proc is gone by the time I inspect /proc/maps. Any way to get around this?

However, here is the map, before I do reverse execution (which is certain to crash the executable) I don't know if this will be useful.

-> Assertion `nwritten == buf_size' failed to hold. Should have written 1 bytes to 0x7f023d1c0040, but only wrote -1
[FATAL rr/rr2/src/Task.cc:1913:write_bytes_helper() errno: EIO] 
 (task 70956 (rec:69472) at time 994)
68000000-68200000 rwxp 00000000 00:00 0 
70000000-70001000 r-xp 00000000 00:2e 2518673                            rr/4.3.0/bin/rr_page_64_replay
558061111000-5580614bc000 r-xp 00000000 08:01 270052                     /usr/bin/php7.0
5580616bc000-558061745000 rw-p 003ab000 08:01 270052                     /usr/bin/php7.0
558061745000-558061761000 rw-p 00000000 00:00 0 
7f0241d43000-7f0241d69000 r-xp 00000000 08:01 683144                     /lib/x86_64-linux-gnu/ld-2.23.so
7f0241f68000-7f0241f6a000 rw-p 00025000 08:01 683144                     /lib/x86_64-linux-gnu/ld-2.23.so
7f0241f6a000-7f0241f6b000 rw-p 00000000 00:00 0 
7ffdebc76000-7ffdebc99000 rw-p 00000000 00:00 0 
7ffdebd5c000-7ffdebd5e000 r-xp 00000000 00:00 0 
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

P.S. You will notice that I'm using 4.3.0 during reverse execution. This is deliberate. I run into #1803 if I do reverse execution on master (so I didn't try it).

rocallahan commented 8 years ago

I'm not able to see /proc/maps as this is a fatal error and the proc is gone by the time I inspect /proc/maps. Any way to get around this?

If your stdout is going to the console, rr should stop and print instructions for emergency debugging of the tracee. At that point the tracee process is still alive and you should be able to dump its maps. If your stdout cannot go to the console for some reason, you can use rr -F to force rr to treat the session as interactive and get the emergency debugging prompt.

Attaching the emergency debugger as instructed and getting its stack trace might be helpful. Also, when rr is stopped there, please attach to rr itself (not the tracee) with gdb and get a stack trace of rr.

rocallahan commented 8 years ago

Also you should switch back to master now that #1803 is fixed.

sidkshatriya commented 8 years ago

After spending a fair amount of time on this I was able to find how to make these errors go away. Perhaps it will help you home onto the problem in rr?

I'm using the distro packaged binaries of PHP 7.0 on 16.04. PHP allows various "zend extensions" to be loaded. I'm loading one such extension called Xdebug. But there is a twist: I need to patch Xdebug with some of my modifications. When I compile Xdebug myself and ask PHP to load the zend extension in php configuration:

zend_extension=/full/path/to/where-I-built/patched/xdebug.so

Then rr works properly in forward mode. The moment I switch to reverse mode I get the fatal error we have been discussing on this ticket.

Now ordinarily, the PHP on 16.04 expects the zend_extensions / modules to be located in /usr/lib/php/.... So, if simply copy over my patched xdebug to /usr/lib/php/... and change the PHP configuration to:

zend_extension=xdebug.so

(Notice that there is no path now. PHP loads xdebug from where it expects to find the extensions/modules). Then things work properly in forward and reverse mode.

So somehow when rr runs in reverse mode it seems to have problems with the explicit path to xdebug.so being given in the PHP configuration. The funny thing is that Xdebug loads perfectly properly in forward mode (with the explicit specified absolute path). Its only in reverse mode when I see this problem.

Might this be related to what the ticket creator might be facing? There might be some shared objects on firefox that are not being identified correctly in reverse mode?

Regarding the emergency debugging session details on rr, I am not able to see them (maybe there are so many of these assert errors that my terminal does show it but it gets scrolled out -- I'll have to check that out again). BTW the error is reported as a segmentation fault and "remote connection is closed".

I thought I'd mention these findings to you before I try to locate the emergency debugging session again you talk about in your last response (have not been successful so far).

rocallahan commented 8 years ago

Try using rr -F to force the emergency debugger.

Maybe rr is pulling in the wrong xdebug.so. During replay, if you check /proc//maps for the process, what lines does it have for xdebug.so?

sidkshatriya commented 8 years ago

Checking /proc/maps to see xdebug.so was a great suggestion. It turns out that the wrong version of xdebug was being loaded by php due to an additional command line option. The version differences were only manifesting themselves during reverse mode (which is a very sensitive operation I can imagine).

TL;DR this is not a problem with rr but was manifesting itself only in reverse mode.

Thank you for spending your precious neural cycles on this :-) !

rocallahan commented 8 years ago

It turns out that the wrong version of xdebug was being loaded by php due to an additional command line option.

That sounds like an rr bug. rr should ensure that replay of PHP sees the same command line options and loads the same libraries. More details would be useful.