joneschrisg commented 11 years ago

Somewhat like #416, except that

Only reproduces with syscall buffering enabled
Only reproduces 1/10 or 1/20 runs or so
Reproduces at what looks like fairly arbitrary points in execution, as opposed to relatively early in startup
Overshoot is by a widely variable amount; I've seen from ~50 to ~2400 rcbs

Smells to me like a memory divergence, but I don't have any specific guesses yet. Reproduces with

rr -v record -b "../ff-prof/dist/bin/firefox" -no-remote -P garbage -reftest file:///home/cjones/rr/mozilla-central/layout/reftests/transform/reftest.list

but doesn't seem to reproduce with the (quick) network reftests used for #416.

joneschrisg commented 11 years ago

Only reproduces 1/10 or 1/20 runs or so

This is with a timeslice of 2500. With the default timeslice, the overshoots reproduce a bit more easily, for whatever reason.

joneschrisg commented 11 years ago

I don't see any failures, VM or bare metal, on netwerk/test/reftest/reftest.list, with any time-slice I've tested.

On layout/reftests/transform/reftest.list, I see two clusters of symptoms. These might be distinct bugs or two different presentations of the same one.

With -c2500, replay diverges maybe 1/10 runs in VM and on bare metal. rr hits the target $ip at the right rcb, but the $ecx register mismatches. It records as 0x0, but replays with some random-ish non-null value. So we resume execution and the tracee overshoots by a widely variable number of rcbs. Interestingly, within the same trace, successive runs overshoot variably by a handful of rcbs. So it's not entirely deterministic.

The next symptom is with the default timeslice, reproduces maybe 1/5 runs, but only in my VM. rr hits the target $ip, but 2-4 rcbs short of the target. (I don't know yet if the registers match.) So rr resumes the tracee and it overshoots by a widely variable amount. Unlike the first symptom, the overshoot here seems deterministic.

joneschrisg commented 11 years ago

rr hits the target $ip at the right rcb, but the $ecx register mismatches. It records as 0x0, but replays with some random-ish non-null value.

I should add, the target $ip has so far always been either in JS or DOM-binding code. Maybe a coincidence maybe not.

rr hits the target $ip, but 2-4 rcbs short of the target

The target so far has always been either nsRuleNode::WalkRuleTree or nsRuleNode::PropagateDependentBit.

joneschrisg commented 11 years ago

rr hits the target $ip, but 2-4 rcbs short of the target. (I don't know yet if the registers match.)

Registers match (2/2 runs anyway; getting harder to repro). So the rbc seems to be confused.

joneschrisg commented 11 years ago

Running out of steam, but learnings this evening

the first and second cluster of symptoms seem to just about always happen when the HTML5 parser/builder is doing something. It may be that it's pretty much always doing something, but the coincidence would be awful strong for the second cluster.
the second cluster only reproduces when the system is under load. I accidentally found this out when my officemate forced to stop playing music, and I suddenly couldn't reproduce the divergence anymore ;). That explains why it wasn't reproducing on my otherwise idle bare-metal machine: with artificially induced load, it reproduces on there too.
it's not too hard to reproduce an overshoot on a real FF browsing session. When that happens, the second cluster of symptoms is seen (not surprisingly).
both clusters seem to happen near futex operations, where's there's a futex cmd an event or two before the divergence. Hand-wavy though, and this being gecko, it might be that pretty much all points in time are pretty close to a futex op.
the second cluster seems to always manifest as reaching the target registers at an rbc that's exactly one of { --2, -4, ,+2 (that's #536) } rcbs away the target. There' may be some two-branch operation that either, respecitvely, happens in recording but not replay, happens twice in recording not in replay, or happens in replay but not recording.

Still not a lot to work with , and don't have a quick-failing testcase yet (well, haven't looked ;) ).

However, I tried reproducing the first cluster of symptoms with checksumming enabled, and caught a very suspicious looking failure. A checksum is failing just after a thread issued a pthread_cond_broadcast. The segment is an anonymous private mmap, so could be pretty much anything. It looks like a tracee scratch region though. There's more to follow up on with that trace.

Futexes keep popping up, and the kernel poking at memory behind rr's back would make some of theses symptoms plausible. So that's made me, naturally, think twice about whether rr implements futexes correctly. I don't believe that it does, but I need to double-check the futex spec tomorrow.

joneschrisg commented 11 years ago

My next step here is to do some long-belated FF-unit-test QA. I still don't have a quick and reliable way to trigger this bug, and we'll probably catch some other (hopefully easier! ;) ) bugs on the tests. Maybe some new leads will turn up too. In parallel, I'll fiddle with the scheduling params on the test that does trigger this, and start looking into #532.

joneschrisg commented 11 years ago

The dom/tests/mochitest/ajax/jquery/ seem to reproduce this reliably, in about 30 seconds of recording time.

Interestingly, the mochitests that trigger this bug seem to be very content-JS-heavy, implying perhaps more JIT activity. There are relatively long-running test suites that aren't content-JS heavy that don't seem to trigger this. That's a potential lead to follow up on.

joneschrisg commented 11 years ago

It came out in #555 that gecko maps read-only shmem shared with pulseaudio. I don't know exactly what it's for yet, but presumably gecko expects pulseaudio to change the contents. That's a potential source of divergence (from #320 being NYI), except that checksumming should catch that so I expect those mappings aren't causing the symptoms here.

joneschrisg commented 11 years ago

Only reproduces with syscall buffering enabled

No longer true; just reproduced without the syscallbuf too.

joneschrisg commented 11 years ago

I've been running through several hypotheses without updating this issue (and have had other distractions). But here's where we're at

The overshoot reproduces unacceptably often with the syscallbuf enabled.
Without the syscallbuf, it takes significant effort to reproduce, so I think it's unlikely to significantly affect users.

That leaves us with a (somewhat sad) escape hatch for 0.1, shipping without the syscallbuf.

The lead in which checksums were failing at a futex for pthread_cond_broadcast has fizzled out. The lead looks good but the checksum error is too infrequent and takes too long to reproduce for me to get memory dumps to compare.

Another idea I checked was whether the SM JITs affected the divergences. With all the goodies turned off (JITs, background compilation, PJS) overshoots still reproduce. But luckily, they're much more consistent when they do occur. So mixed bag.

I noticed that overshoot frequency tends to vary with CPU load. That made me suspect frequency scaling. So I tried some experiments, but they were inconclusive because of one unfortunate problem: when I set the right bits to pin CPU frequency, it still varied. Some docs I found implied it's not possible to fully pin the frequency of Sandy Bridge / Ivy Bridge chips. But in either case, the lead is cold.

The last hand-wavy idea I have is that the x86 instructions may be atomic wrt counter interrupts, I wonder if they're not deterministically atomic. For example, during the recording execution, a CAS might succeed, but during the replay execution, it might fail, due to who knows what inside the black box. If an issue like that existed, I think it would still be fair to call the implementation "precise", but obviously that would be bad for rr. (Though maybe not fatal.) But, I don't have a good idea how to test this yet, short of calling on the folks in Santa Clara ;).

So the two last low-cost things I'm going to try are

Record a trace with a small number of rcbs before divergence, and dump all the insns executed before divergence. (I've done this near the divergence, but not all the way.) Maybe something will pop up.
Recheck if a bad trace still diverges w/o breakpoint-on-target-insn. We checked this in the context of #416, but it turns out that a few different bugs were probably involved there, so maybe this symptom is different.

Then the high-cost #532.

rocallahan commented 11 years ago

Thanks for the update!

532 might not take that long if you start with the Ball-Larus path profiler plugin and modify it to append to a TLS log (memory-mapped) instead of incrementing counters.

joneschrisg commented 11 years ago

Sounds like a good idea, will take a look.

joneschrisg commented 10 years ago

Record a trace with a small number of rcbs before divergence, and dump all the insns executed before divergence. (I've done this near the divergence, but not all the way.) Maybe something will pop up.

This is yielding fruit: grabbed a few traces I can work with. The first I'm looking at it elatively easy (though time consuming) to track down, but it looks like either some cairo-xlib flags are diverging (which would be interesting but unlikely), or some newly-malloc'd memory has a divergent value (more likely). Should know tomorrow morning.

joneschrisg commented 10 years ago

or some newly-malloc'd memory has a divergent value (more likely)

It was this, and unfortunately hit a dead end. We diverge because $al is supposed to be 0x01 at a time-slice interrupt, but instead it's 0xc1. The memory cell is within a cairo_xlib_surface_t that's newly-malloc'd, uninitialized heap memory. It's a byte-sized section of bitfield for three bits. The code that sets the bitfields loads the byte memory cell, then masks off the bits not in the bit being set and OR's in the new bit (approximately). So the code doesn't mutate the memory bits that aren't part of the bitfield.

All this means that the two high bits of (byte*)0x68226d68 are set during replay but weren't set during recording, at this time-slice event (time 126705). Tracing backwards, the bits are first set by this statement during time 72527

Line 2354 of "/home/cjones/rr/mozilla-central/js/src/jsinfer.cpp"
bool
TypeObject::addProperty(ExclusiveContext *cx, jsid id, Property **pprop)
2354:
    *pprop = base;

That's a "long time" before the cairo-xlib code runs, and obviously has nothing to do with cairo-xlib. I'm not familiar with those code, but it's setting an outparam memory address that's derived from a LIFO JS allocator, so it seems unlikely the allocated address diverged. Possible though.

What seems more likely is that the diverging memory cell was freed by the JS pool, then reallocated at some other point(s) before the cairo-xlib code runs. The other mutating site(s) were supposed to clear the bits, but didn't for some reaon. To figure that out, we'd need to implement write-watchpoints and then look at the backwards slice from the writes to that cell between events 72527 and 126705. Not too bad, but I'm going to put this aside and see if I can make progress on the other traces I have.

(I double-checked and rr does actually execute munmap calls during replay, which would be another possibility here.)

joneschrisg commented 10 years ago

Another thing I observed working on this trace is that, quite infrequently, the divergence happens at a different time point. So the divergence isn't entirely deterministic :/. There are a few old friends that are suspicious for something like that (CLONE_CLEAR_TID and SKID_REGION hacks come to mind).

joneschrisg commented 10 years ago

Interestingly, the symptom on the second trace is pretty much the same: a byte has the bits x11xxxxx set in replay, but they weren't set in recording. Coincidentally (?), this divergence also occurred near some cairo-xlib code. The bits were last set by ConvertUTF8toUTF16::write(), when converting mozilla-central/dom/locales/en-US/chrome/dom/dom.properties. So we're "stuck" at the same place we were with the first trace.

joneschrisg commented 10 years ago

Third trace is another mystery bit, xx1xxxxx this time, apparently set by js::NewDenseCopiedArray. Don't see anything suspicious in there.

joneschrisg commented 10 years ago

Still making forward (albeit very slow) progress on this lead. The memory divergences I've been seeing made me think of using checksumming to see if there's some "source" for the divergent values. (My kingdom for valgrind-on-rr!) Using a new blend of heuristics, I was able to get checksumming working again on FF runs. The result was very interesting: there are pages being allocated by SM (js::gc::Chunk::allocate) that seem to have persistent and and increasing number of divergent values in bookkeeping metadata (Chunk::init(JSRuntime *rt)). This is a bit hard to believe, so the tools are suspect, but at any rate I put it aside.

I captured a trace in which there's a checksum divergence and an overshoot to see if the two were related. They don't seem to be, but the overshoot proved very interesting. It's again two "mystery bits", 0bxx11xxxxx, diverging in a bitfield byte. The bits are originally set by MOZ_Z_inflate_fast() on behalf of js::FunctionToString(), which is reminiscent of the ConvertUTF8toUTF16 divergence above. However, the member I need to find the script source location was optimized out, and I can't make the function calls I need to reconstruct it :/. This again doesn't mean the bits weren't overwritten.

So more leads to follow.

joneschrisg commented 10 years ago

This again doesn't mean the bits weren't overwritten.

Or not overwritten when they should have been.

joneschrisg commented 10 years ago

The bits are originally set by MOZ_Z_inflate_fast() on behalf of js::FunctionToString() [...] However, the member I need to find the script source location was optimized out, and I can't make the function calls I need to reconstruct it :/. This again doesn't mean the bits weren't overwritten.

This particular symptom is moderately easy to reproduce. I've got it tracked now, in one trace at least, to writing a ' ' char to the heap, in trim: function( text ), as part of decompressing the source of dom/tests/mochitest/ajax/jquery/dist/jquery.js. There's no nondeterminism around here, so it's looking pretty strongly like bits not being cleared that were cleared during recording. My jury-rigged instrumentation can't be extended to catch those kinds of bugs (nontrivially), so I'm ready to throw in the towel. (I have the nagging sensation that if I were cleverer, the answer may still be reachable with what I've got so far, but things are what they are ;) .)

To finish off, I re-checked that a couple "dumb" things weren't the problem (i) direct mmaps; no, copying all mapped files still diverges; (ii) CLEAR_TID-race-sleep timeout; no, setting it to ridiculous 1s still diverges.

joneschrisg commented 10 years ago

I spent a little time this morning to characterize the symptoms we're still seeing here, and approximately how frequent they are. (Got distracted by another bug before posting, sorry.) There are 5 main symptoms I'm seeing (on a mochitest sub-suite I'm focusing on), which may or may have shared causes

overshooting syscalls (mostly write) by 1 or 2 rcbs. That is, at entry to syscall X, the rbc is X - 1, but it should have been X. This might be a HW issue, but it's not clear yet.
under/overshoot timeslice target by +/- 2 rcbs. That is, we're trying to reach rcb X, but registers become equal at X +/- 2, and aren't equal at X. Another possible HW issue, also not clear.
what I call "mystery bits" in $edx. Recording saves the value, e.g., 0x43caf102, but replay sees 0x43caf122. Most likely a SW issue.
like (2), but the registers match "far away" from the rcb target. Around +/-10 off target. Smells like a big divergences, likely SW.
and exactly once I caught a divergence in a tracee-save-data deflate_state struct, for the zipwriter. This is being investigated in #532.

With "upstream" rr, I see symptom (1) about half the time, and symptom (2) a bit less than half. The others reproduce infrequently. On an idle system, (1) predominates. On a loaded system, (2) predominates. (I didn't crunch these numbers but can if someone wants them.)

I tried adding a hack to expand the check in (1) to be target - 1 <= rcb_now <= target + 1, on the theory that it might be a HW bug to work around. The new frequencies depend on whether the system is idle or busy. (These numbers are based on a relatively small sample, to take with a grain of salt.)

busy system: (2) produces about half the time. The rest is evenly divided among (3), (4) and the replay successfully completing.
idle system: (3) is about 80%, (2) 10%, success 10%

All this suggests that

(2) correlates with system load
(1) may "hide" (3), which would further suggest they may be symptoms of the same cause

(This is mostly relevant to external folks, because the debugging approach we would take at the rr level is the same for all these symptoms :). I have glibc just about set up to make an instrumented build.)

joneschrisg commented 10 years ago

Great lead on symptom (1) above: on an uninstrumented build, it seems to almost always reproduce under nsGlobalWindow::Dump(). If I add a syscall(SYS_gettid) just before Dump calls into fputs, then the overshoot still reproduces ... but with a very small number of rcbs (~100). That's pretty telling evidence that something in fputs is diverging. Now that I have a local libc build, I can dig further.

joneschrisg commented 10 years ago

I have the problem bookended around a strlen in libc, like so

  syscall(SYS_gettid);  // (1)
  strlen(...);  // (2)
  syscall(SYS_getpid);  // (3)

Everything is fine at (1). When we get to (3), then $edx and $ecx have diverged from replay. (I'd like to hope this is the "mystery bits" symptom, too.) So strlen is highly suspect, especially because it uses some fairly exotic features.

What's most odd about this is that it doesn't diverge without the syscallbuf. That makes me suspect a kernel bug, in which case we have some workarounds we can bring to bear. But will get this nailed down as far as I can asap.

rocallahan commented 10 years ago

Exciting :-)

joneschrisg commented 10 years ago

I've been testing whether we can switch over to using the insns-retired counter ("irc") instead of rbc. That would solve a lot of problems, among them (most likely) resolving this bug. The results are mixed but somewhat promising.

The irc is way off, "out of the box". Adjusting counts per this paper [1], things don't change much. But subtracting out the stepi's and breakpoints set during replay puts counts in the right ballpark.

First issue is that, on a test that generates only timeslice events, irc doesn't 100% agree with rbc; irc is off by +/-1 occasionally, compared to recording at the target rbc. There's an outside possibility they're both right (e.g. if the test is running insns that don't affect the GPRs, or retired branches are counted differently for irc), but I doubt it. However, it's not really a fair test because we're programming rbc interrupts during replay instead of insns-retired, so the PMU may be behaving slightly differently. That difference is investigate-able.

Second issue is that on a heavily syscallbuf test, the irc disagrees with replay by 10-20 insns. I haven't looked into that deeply yet but it smells like an rr bug, not accounting for some source of overcount (or capturing irc at the wrong place in recording).

Third issue is that the fact that stepi's and breakpoints affect the irc means that hitting execution targets when replay is driven by gdb is going to get harder. But that's "just work".

Switching to irc interrupts solves enough problems that I'm going to keep poking at it until it gets us hard-stuck.

[1] http://web.eece.maine.edu/~vweaver/projects/deterministic/ispass2013_deterministic.pdf

joneschrisg commented 10 years ago

Have been spending some time investigating "the strlen() lead" so that we can be more confident that switching to programming the irc will fix it. What I've found so far

I saved the xfpregs along with the regular register set and compared across record/replay. The xfpregs mostly match; they mismatch at clone and exec, for reasons I don't understand. But most importantly, they match at the strlen() divergence. Feh.
there are two divergence symptoms: (i) ecx/edx differ across record/replay; (ii) rbc overshoot by 1. I saw (i) on its own, and also (i) along with (ii), but never (ii) without (i). So I think we can say that whatever causes (ii) is attributable to (i).
when ecx/edx diverge: when rbc agrees then the insns-retired-counter (irc) also agrees, and when rbc diverges irc diverges. So it seems that whatever leads to symptom (ii) above affects both rbc and irc.
kind of a basic sanity check, but both the address of the string passed to strlen() and its contents agree across record/replay, even when ecx/edx diverge
(my biggest hope so far :/) desched signals (no-op) do not fire around when replay diverges. That makes the fact that divergences ~always occur with syscallbuf but ~never with no-syscallbuf more puzzling.
given a diverging trace X: X diverges because of differing ecx/edx values. Replaying X, except replaying the divergent event by stepi'ing, shows the same ecx/edx divergence. (But also with a new eflags one, filed elsewhere.) So replay ~agrees with stepi-replay, and both disagree similarly from recording.

I also took a closer look at the strlen implementation that my custom-built glibc is running (the one that I've been catching the recent divergences on), and turns out it's not using any exotic features: my build ends up running some i586 "fallback", for whatever reason. That impl doesn't do anything exotic; instead it implements a cute some-zero-byte-in-word detection algorithm, along with some manual loop unrolling, along with some ILP scheduling "guesses". No extension registers or insns used. But it is hand-written asm intended to keep the superscalar pipelines full. No guesses to be made yet, but whatever the problem is looks quite nontrivial.

Next order of business is to see if the strlen() code being context-switched in recording may correlate with replay divergences.

joneschrisg commented 10 years ago

Next order of business is to see if the strlen() code being context-switched in recording may correlate with replay divergences.

After a failed digression to try writing a small strlen test to repro more quickly, and a mini comedy of errors with the perf context-switch event, I finally got some results: during recording, the event reports 2 switches. During replay, at the point we diverge, only 1 switch is reported. But, that happens at all the nondiverging previous events I looked at too. This doesn't count cases where the task traps to the kernel for whatever reason, and another task isn't scheduled in its place. (The desched machinery relies on that!)

So it's still possible that a non-descheduling interrupt fires during the recording. The docs for the HW_INTERRUPTS event we monitor are pretty spotty, but seem like they should serve this purpose. Indeed, if I check USR_SCHED events, they typically record 1-3 "interrupt" events, and 1 or 2 context switches. (Not sure whether the rbc interrupt itself is counted there; I do see a USR_SCHED with 0 interrupts, but that might be due to some other effect It also shows 294 context switches.).

So the interrupt trail is going a bit cold. There's more to do on it, but I think #597 (eflags divergence) is the next easiest and most potentially informative direction.

joneschrisg commented 10 years ago

I went ahead and ruled out a few things that were easier to test than digging through #597

PTRACE_GETREGS race condition: fetching the registers 10 times sequentially changes the relative frequency of divergence symptoms, but the regs are always the same and the strlen divergence still pops up. Similar symptoms for usleep(1000); fetch;, and fetch; usleep(1000); fetch. So pretty clearly the timing, or context switching, affects the divergence, but the tracer->tracee request isn't the problem. (In theory the kernel could be botching the register-save, but that seems about 0% likely.) So this is strongly ruled out.
Exotic CPU pipeline-dependent bug: for syscall(gettid); strlen(); syscall(getpid), we see the register divergence at syscall(getpid). If I add 32 nop's before that call, then the symptom is the same. Not ruled out, but seems more unlikely.
libc syscall helper / vdso: my getpid above is made through a glibc INLINE_SYSCALL() helper that calls into the vdso through tcb info. Replacing that with int $0x80 didn't change anything.

rocallahan commented 10 years ago

Recording what we discussed on IRC for posterity: The code for strlen is viewable here: http://repo.or.cz/w/glibc.git/blob/0d204b0a522b9fef2168e1808a0566c9209d0893:/sysdeps/i386/i586/strlen.S An interesting possibility is that the parity bit gets corrupted somehow just before the "jp" instruction. That would cause a potential rbc mismatch, and the strlen code would still work and give the right answer, but with ecx and edx diverged.

We could test this, and other hypotheses, by capturing the code coverage of each strlen call with some crude instrumentation. One approach would be to create a single-byte global flag for each basic block (I count 20), clear them all to zero before the call to strlen, and in each basic block of strlen use a single instruction to store 1 to its flag. And of course after the strlen call, capture the values to a log. This would not affect registers or EFLAGS and I'm hopeful we could still reproduce the bug. (To be even less invasive, use the free registers (bl, bh, esi, edi, ebp) instead of global flags and instrument 5 basic blocks per experiment, although that gets more complicated to analyze.) Knowing the basic blocks that were executed in both the record and replay, plus the values of edx and ecx, and the address and contents of the string, I think we could deduce almost everything about the execution of strlen in record and replay.

rocallahan commented 10 years ago

Actually I've got a slightly better idea for the instrumentation. Number the basic blocks starting after each label, from one, use one variable per label, and record how many blocks we executed through after jumping to the label. There are 5 labels, and conveniently we have 5 usable registers.

rocallahan commented 10 years ago

I guess we have less registers than that depending on which ones the caller restores in the bookending code. But using registers has the advantage of being thread-local.

joneschrisg commented 10 years ago

Another interesting idea was that the seccomp-bpf code is corrupting eflags. I recorded without the syscallbuf, but with two different filter progs, but didn't see a divergence. That should have been a pretty representative test, because the syscalls that bookend the diverging strlen are made directly through syscall() and so are traced. So that hypothesis looks like a "probably not it".

joneschrisg commented 10 years ago

It's taken me a bit of time to get some asm instrumentation put together, but after working through the annoyances (and with a final bugfix tip from @andreasgal) it's working. I started out with the barest-bones instrumentation, just setting a bit in $edi for each labeled branch target. IIUC that should have been able to rule out a corrupted PF, since L(0) should be hit record XOR replay (and that corresponds to a bit that should have diverged). But unfortunately, I recorded executions where ecx diverged (I need to use edx for instrumentation, so its likely-diverging value is stomped) and the rbc values mismatched ... BUT the edi flags were the same. So I think we can rule out a PF issue.

Next step is to collect block counters.

joneschrisg commented 10 years ago

(FTR, instrumented strlen.S and iofputs.c.)

joneschrisg commented 10 years ago

Next step is to collect block counters.

The last commit counts the number of times the labeled blocks (L(0), L(1), L(2), L(3)) are executed. The same $ecx (and presumably $edx, if it weren't being stomped) divergence is seen. rbc also differs. But, the block counts are the same (!!).

The last suspect seems to be the manually unrolled loop in L(1). If that's indeed where we're diverging, then it would seem that our rbc interrupt isn't actually precise after all.

joneschrisg commented 10 years ago

The last suspect seems to be the manually unrolled loop in L(1).

Nope. On to BBs in L(3).

joneschrisg commented 10 years ago

Nope. On to BBs in L(3).

Nope. scratches head

joneschrisg commented 10 years ago

The zero-testing instrumentation allowed me to easily deduce what $ecx was supposed to be for this execution. The results very strongly indicate that recording is catching CPU state from speculative execution. Yikes! Here's my line of reasoning

The divergence is

 -> ecx 0x7720000a != 0xa (replaying vs. recorded)
[EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 20100 (rec:19900) at trace line 128623)
 -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch
    (rbc/rec:83/82; irc/rec:388/385)

IO_fputs is calling strlen(str=0x5fa42f20 "947 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: CSS Auto to show - Showing, display should block: block: block\n"). So the strlen is 153, and the NULL byte is at address 0x5fa42fb9. The bytes around the end of string are

          0  1  2  3  4  5  6  7
5fa42fb8 0a 00 20 77 0a 00 00 00

The value of $ecx in replay is 0x7720000a (see above), and in recording it was 0x0a. I.e. the replayed value is

$ecx = 0x 77 20 00 0a

This replayed value is "correct" per the strlen algorithm (executed on a hypothetical in-order non-speculating CPU). The recorded value is "wrong" (on the same hypothetical CPU). Comparing to the bytes around the end of the string, it's further instructive to see how it's wrong.

First, the word that was fetched at address 0x5fa42fb8 is 152 bytes past the start of the string, 0x5fa42f20. That means it's the 38th word fetched, which further means it was fetched in the third "iteration" of the unrolled loop at L(2).

Next, assume that the CPU speculatively executed the fourth iteration of the unrolled loop. (This is reasonable to assume because the jnc L(3) in the third unrolled iteration would be > 99% predicted to not be taken.) What values would be loaded into $ecx? If just the movl (and possibly the following addl, doesn't matter) were speculated, then 0x00 00 00 0a would be loaded into $ecx. The lower two bytes are indistinguishable from the "correct" value of 0x772000a, and the upper two bytes match the "incorrect" value 0x0000000a.

Maybe because the zero-testing code is only reading $cl and $ch, and the "correct" values of those happen to be the same as the "incorrect" speculated value, the CPU somehow isn't determining that it should roll back the speculation? I have no clue.

I'll follow this deduction chain for another divergence.

joneschrisg commented 10 years ago

I should add that I don't have a hypothesis to explain the divergent rbc/irc numbers.

joneschrisg commented 10 years ago

I'll follow this deduction chain for another divergence.

The results are basically the same, except (i) no rbc/irc divergence; (ii) the possibly-speculated value doesn't match in %cl / %ch. I won't go through the long-winded chain of reasoning again, but here's the raw data

 -> ecx 0x90000 != 0x0 (replaying vs. recorded)
[EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 20894 (rec:20842) at trace line 157932)
 -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch
    (rbc/rec:97/97; irc/rec:453/453)
Recorded regs:
  eax:0xffffffda ebx:0x1 ecx:0x0 edx:0x1 esi:0x0 edi:0x0 ebp:0x0
str=0x6960b880 "1154 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: Chain fadeIn fadeOut - Make sure that overflow is reset (Old: visible Cur: visible): visible\n"
strlen(str) = 184
0x6960b938 - 0x6960b880 = 0xb8 = 184
184 / 16 = 11 rem 2
 == third unrolled loop
Replayed
$ecx = 0x 00 09 00 00
vs. recorded 
$ecx = 0x 00 00 00 00
          0  1  2  3  4  5  6  7
6960b938 00 00 09 00 00 60 00 00

So a speculated load of the next word that wasn't rolled back would explain the difference in $ecx values.

joneschrisg commented 10 years ago

I caught another one of these that's much different. Raw data

 -> ecx 0x4500 != 0x0 (replaying vs. recorded)
[EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 22418 (rec:22363) at trace line 129089)
 -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch
    (rbc/rec:79/79; irc/rec:372/372)
  eax:0xffffffda ebx:0x1 ecx:0x0 edx:0x1 esi:0x0 edi:0x0 ebp:0x0
str=0x60455e80 "943 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: CSS Auto to 100 - Final width should be 100: 100px: 100px\n"
strlen(str) = 148
0x60455f14 - 0x60455e80 = 0x94 = 148
148 / 16 = 9 rem 1
 == second unrolled loop
Replayed
$ecx = 0x 00 00 45 00
vs. recorded 
$ecx = 0x 00 00 00 00
          0  1  2  3  4  5  6  7
60455f14 00 69 00 00 00 35 81 84

I'm not sure why $ch is 0x45 instead of 0x69, but I haven't worked out what the bytes should be for both any-byte-in-word-is-zero tests in the unrolled loop iterations.

A not-rolled-back-speculative-load doesn't explain this divergence. However, another hypothesis emerges: perhaps there's a speculative load to $ecx, but the register bytes past the last read of $cl/$ch are rolled back to zero. Looking over the three executions

Replayed
$ecx = 0x 77 20 00 0a
vs. recorded
$ecx = 0x 00 00 00 0a
(NULL byte is in $ch)
Replayed
$ecx = 0x 00 09 00 00
vs. recorded 
$ecx = 0x 00 00 00 00
(NULL byte is in $cl)
Replayed
$ecx = 0x 00 00 45 00
vs. recorded 
$ecx = 0x 00 00 00 00
(NULL byte is in $cl)

So the bytes up to the last read of cl/ch match, but the ones following it are all 0x00. That's a pretty weak hypothesis. Another one is that some kind of interrupt-handler (??) code clears $ecx but doesn't properly save/restore in recording and not in replay.

joneschrisg commented 10 years ago

$ch is 0x45 instead of 0x69

Oops, that's because I dumped something in decimal that supposed to be hex. 69 == 0x45.

joneschrisg commented 10 years ago

Another run

 -> ecx 0x5bcdd600 != 0x0 (replaying vs. recorded)
[EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 23262 (rec:23210) at trace line 131626)
 -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch
    (rbc/rec:80/79; irc/rec:375/372)
  eax:0xffffffda ebx:0x1 ecx:0x0 edx:0x1 esi:0x0 edi:0x0 ebp:0x0
str=0x5fde5c00 "972 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: JS Auto to 100 - Final height should be 100: 100px: 100px\n"
strlen(str) = 148
148 / 16 = 9 rem 1
 == second unrolled loop
Replayed
$ecx = 0x 5b cd d6 00
vs. recorded 
$ecx = 0x 00 00 00 00
          0  1  2  3  4  5  6  7
5fde5c00 00 d6 cd 5b 01 00 00 00

So nothing new added.

joneschrisg commented 10 years ago

Another possibility is that the word just past the end of the string is being written racily somehow.

joneschrisg commented 10 years ago

Another possibility is that the word just past the end of the string is being written racily somehow.

I should add, this looks more likely because I'm now logging $edx, and the divergent ecx/edx value pairs seem to be consistent with taking different L(1) loop exits across record/replay. (That's also a much more plausible hypothesis than a CPU bug!) Testing by instrumenting nsGlobalWindow::dump().

joneschrisg commented 10 years ago

Another possibility is that the word just past the end of the string is being written racily somehow.

It's this. The symptom is similar to other leads that have gone cold in the past: some memory cell is cleared in recording but isn't cleared during replay. This is pretty hard to track down because we don't know a priori which cell should be cleared during recording. I've got some instrumentation that's throwing off leads, but nothing conclusive yet.

joneschrisg commented 10 years ago

It strikes me that if rr could record rr record, then we could sort this bug pretty easily. That's been a tongue-in-cheek goal but maybe we should take it a little more seriously.

joneschrisg commented 10 years ago

It strikes me that if rr could record rr record, then we could sort this bug pretty easily.

Actually, on second thought, I don't have any reason to expect that the same divergence behavior wouldn't also happen in the "outer" recording. So nm.

rocallahan commented 10 years ago

The memory contents diverge before the call to strlen, right?

Are there any patterns governing the alignment of the diverging memory range? How about its length? Has anything been allocated that overlaps the diverging memory range? (I assume that in these dump() cases the diverging memory range is in the JS heap?)

joneschrisg commented 10 years ago

The memory contents diverge before the call to strlen, right?

Right. My instrumentation is

    // Record the string bytes rounded up to a multiple of 16.
    char* cEnd = cstr;
    while (*cEnd) ++cEnd;
    syscall(4/*SYS_write*/, -42, cstr, ((cEnd - cstr) + 15) & ~15);

inserted just before the call to fputs() here.

Are there any patterns governing the alignment of the diverging memory range? How about its length?

The key to the length is that the \0 byte must be at index i where i % 4 != 3. That means the last word-load will read bytes past the end of the string. Beyond that, the lengths have been around 140-200 bytes. (See below about alignment.)

Has anything been allocated that overlaps the diverging memory range? (I assume that in these dump() cases the diverging memory range is in the JS heap?)

The string is allocated by ToNewUTF8String(), which (through a torturous route) ends up allocating from jemalloc. IIRC jemalloc hands out 16-byte-aligned segments (or maybe it was 8). Either way, the "garbage" memory that strlen is touching should be within the jemalloc block.

rr-debugger / rr

Memory contents diverging #544

532 might not take that long if you start with the Ball-Larus path profiler plugin and modify it to append to a TLS log (memory-mapped) instead of incrementing counters.