rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.13k stars 583 forks source link

Assertion `nwritten == buf_size' assertion failed #2163

Open mgaudet opened 6 years ago

mgaudet commented 6 years ago

Using a VMWare guest (configured as suggested) on an OS/X host, I have been seeing this assertion pretty regularly, using 3a9e68ce2d4c688fe4c22c7a73ea9368fe09fcd7

[FATAL /home/mgaudet/rr/src/Task.cc:2236:write_bytes_helper() errno: EIO] 
 (task 95064 (rec:94946) at time 832)
 -> Assertion `nwritten == buf_size' failed to hold. Should have written 1 bytes to 0x3ad9a8252b80, but only wrote -1
[FATAL /home/mgaudet/rr/src/Task.cc:2236:write_bytes_helper() errno: EIO] 
 (task 95064 (rec:94946) at time 832)
 -> Assertion `nwritten == buf_size' failed to hold. Should have written 1 bytes to 0x3ad9a8252b80, but only wrote -1
[FATAL /home/mgaudet/rr/src/Task.cc:2236:write_bytes_helper() errno: EIO] 
 (task 95064 (rec:94946) at time 832)
 -> Assertion `nwritten == buf_size' failed to hold. Should have written 1 bytes to 0x3ad9a8252b80, but only wrote -1

(repeats a couple thousand times unfortunately)

Unfortunately, I don't have a reliable set of steps to reproduce on a generic program, though I have found that when I do encounter this issue, I encounter it again when repeating the same steps.

One thing I can point out that seems to be pretty common: When I see this it's often because I have set a breakpoint on JIT compiled code; in the above message, 0x3ad9a8252b80 is a code pointer where I set a breakpoint just prior.

Unlike in #2161 I don't get an RR backtrace, and the child seems to be dead immediately, so I've been unable to follow similar debugging steps.

I have run rr pack and archived the directory, and can pass it on in case that's desired.

(Honestly, the biggest bother so far about this bug has been the incredibly large spew of the same assertion failure)

mgaudet commented 6 years ago

One other thing I should mention: Running the test suite, all the tests passed, with the exception of the HLE tests.

khuey commented 6 years ago

Hmm. Are you attempting to execute forwards or backwards after setting the breakpoint? Can you rr dump -p <trace-path> 1-832 and post that output somewhere?

mgaudet commented 6 years ago

This would be executing forwards

events dumped as requested

It's possible here that I hit the end of program recording. Though, if that is the case, it's not the common case, I think.

rocallahan commented 6 years ago

When I see this it's often because I have set a breakpoint on JIT compiled code

My guess is that we're trying to set a software breakpoint in memory that is unmapped, probably because you're reverse-executing and we execute through a region of time before the memory was mapped.

If you use hardware breakpoints (hbreak) instead of regular breakpoints does the problem go away?

mgaudet commented 6 years ago

hbreak does seem to do it! I was experiencing this forward executing as well, but I am guessing it was crashing when the code pages were unmapped as the process shuts down (looking at the recording I can now see I never hit this breakpoint, so just fly into the end of the recording).

Thanks!

rocallahan commented 6 years ago

Let's leave this open because rr should probably handle this in some reasonable way.

glandium commented 5 years ago

Something similar happened to me, but I can't work around with hbreak because that happens with the implicit breakpoints set by reverse-* commands.

rocallahan commented 5 years ago

I don't want to fix this myself because I need to focus on work that I might eventually get paid for. But if someone else fixes it (with a test!), I'll gladly merge their PR.

I think it would be pretty easy to fix this in AddressSpace: silently ignore failure to place a software breakpoint, and whenever a new mapping is created, reapply all software breakpoints (or better still, just the ones that overlap the new mapping).