Closed joneschrisg closed 10 years ago
- Only reproduces 1/10 or 1/20 runs or so
This is with a timeslice of 2500. With the default timeslice, the overshoots reproduce a bit more easily, for whatever reason.
I don't see any failures, VM or bare metal, on netwerk/test/reftest/reftest.list, with any time-slice I've tested.
On layout/reftests/transform/reftest.list, I see two clusters of symptoms. These might be distinct bugs or two different presentations of the same one.
With -c2500, replay diverges maybe 1/10 runs in VM and on bare metal. rr hits the target $ip at the right rcb, but the $ecx register mismatches. It records as 0x0
, but replays with some random-ish non-null value. So we resume execution and the tracee overshoots by a widely variable number of rcbs. Interestingly, within the same trace, successive runs overshoot variably by a handful of rcbs. So it's not entirely deterministic.
The next symptom is with the default timeslice, reproduces maybe 1/5 runs, but only in my VM. rr hits the target $ip, but 2-4 rcbs short of the target. (I don't know yet if the registers match.) So rr resumes the tracee and it overshoots by a widely variable amount. Unlike the first symptom, the overshoot here seems deterministic.
rr hits the target $ip at the right rcb, but the $ecx register mismatches. It records as 0x0, but replays with some random-ish non-null value.
I should add, the target $ip has so far always been either in JS or DOM-binding code. Maybe a coincidence maybe not.
rr hits the target $ip, but 2-4 rcbs short of the target
The target so far has always been either nsRuleNode::WalkRuleTree
or nsRuleNode::PropagateDependentBit
.
rr hits the target $ip, but 2-4 rcbs short of the target. (I don't know yet if the registers match.)
Registers match (2/2 runs anyway; getting harder to repro). So the rbc seems to be confused.
Running out of steam, but learnings this evening
Still not a lot to work with , and don't have a quick-failing testcase yet (well, haven't looked ;) ).
However, I tried reproducing the first cluster of symptoms with checksumming enabled, and caught a very suspicious looking failure. A checksum is failing just after a thread issued a pthread_cond_broadcast
. The segment is an anonymous private mmap, so could be pretty much anything. It looks like a tracee scratch region though. There's more to follow up on with that trace.
Futexes keep popping up, and the kernel poking at memory behind rr's back would make some of theses symptoms plausible. So that's made me, naturally, think twice about whether rr implements futexes correctly. I don't believe that it does, but I need to double-check the futex spec tomorrow.
My next step here is to do some long-belated FF-unit-test QA. I still don't have a quick and reliable way to trigger this bug, and we'll probably catch some other (hopefully easier! ;) ) bugs on the tests. Maybe some new leads will turn up too. In parallel, I'll fiddle with the scheduling params on the test that does trigger this, and start looking into #532.
The dom/tests/mochitest/ajax/jquery/ seem to reproduce this reliably, in about 30 seconds of recording time.
Interestingly, the mochitests that trigger this bug seem to be very content-JS-heavy, implying perhaps more JIT activity. There are relatively long-running test suites that aren't content-JS heavy that don't seem to trigger this. That's a potential lead to follow up on.
It came out in #555 that gecko maps read-only shmem shared with pulseaudio. I don't know exactly what it's for yet, but presumably gecko expects pulseaudio to change the contents. That's a potential source of divergence (from #320 being NYI), except that checksumming should catch that so I expect those mappings aren't causing the symptoms here.
- Only reproduces with syscall buffering enabled
No longer true; just reproduced without the syscallbuf too.
I've been running through several hypotheses without updating this issue (and have had other distractions). But here's where we're at
That leaves us with a (somewhat sad) escape hatch for 0.1, shipping without the syscallbuf.
The lead in which checksums were failing at a futex for pthread_cond_broadcast has fizzled out. The lead looks good but the checksum error is too infrequent and takes too long to reproduce for me to get memory dumps to compare.
Another idea I checked was whether the SM JITs affected the divergences. With all the goodies turned off (JITs, background compilation, PJS) overshoots still reproduce. But luckily, they're much more consistent when they do occur. So mixed bag.
I noticed that overshoot frequency tends to vary with CPU load. That made me suspect frequency scaling. So I tried some experiments, but they were inconclusive because of one unfortunate problem: when I set the right bits to pin CPU frequency, it still varied. Some docs I found implied it's not possible to fully pin the frequency of Sandy Bridge / Ivy Bridge chips. But in either case, the lead is cold.
The last hand-wavy idea I have is that the x86 instructions may be atomic wrt counter interrupts, I wonder if they're not deterministically atomic. For example, during the recording execution, a CAS might succeed, but during the replay execution, it might fail, due to who knows what inside the black box. If an issue like that existed, I think it would still be fair to call the implementation "precise", but obviously that would be bad for rr. (Though maybe not fatal.) But, I don't have a good idea how to test this yet, short of calling on the folks in Santa Clara ;).
So the two last low-cost things I'm going to try are
Then the high-cost #532.
Thanks for the update!
Sounds like a good idea, will take a look.
Record a trace with a small number of rcbs before divergence, and dump all the insns executed before divergence. (I've done this near the divergence, but not all the way.) Maybe something will pop up.
This is yielding fruit: grabbed a few traces I can work with. The first I'm looking at it elatively easy (though time consuming) to track down, but it looks like either some cairo-xlib flags are diverging (which would be interesting but unlikely), or some newly-malloc'd memory has a divergent value (more likely). Should know tomorrow morning.
or some newly-malloc'd memory has a divergent value (more likely)
It was this, and unfortunately hit a dead end. We diverge because $al
is supposed to be 0x01 at a time-slice interrupt, but instead it's 0xc1. The memory cell is within a cairo_xlib_surface_t
that's newly-malloc'd, uninitialized heap memory. It's a byte-sized section of bitfield for three bits. The code that sets the bitfields loads the byte memory cell, then masks off the bits not in the bit being set and OR's in the new bit (approximately). So the code doesn't mutate the memory bits that aren't part of the bitfield.
All this means that the two high bits of (byte*)0x68226d68
are set during replay but weren't set during recording, at this time-slice event (time 126705). Tracing backwards, the bits are first set by this statement during time 72527
Line 2354 of "/home/cjones/rr/mozilla-central/js/src/jsinfer.cpp" bool TypeObject::addProperty(ExclusiveContext *cx, jsid id, Property **pprop) 2354: *pprop = base;
That's a "long time" before the cairo-xlib code runs, and obviously has nothing to do with cairo-xlib. I'm not familiar with those code, but it's setting an outparam memory address that's derived from a LIFO JS allocator, so it seems unlikely the allocated address diverged. Possible though.
What seems more likely is that the diverging memory cell was freed by the JS pool, then reallocated at some other point(s) before the cairo-xlib code runs. The other mutating site(s) were supposed to clear the bits, but didn't for some reaon. To figure that out, we'd need to implement write-watchpoints and then look at the backwards slice from the writes to that cell between events 72527 and 126705. Not too bad, but I'm going to put this aside and see if I can make progress on the other traces I have.
(I double-checked and rr does actually execute munmap calls during replay, which would be another possibility here.)
Another thing I observed working on this trace is that, quite infrequently, the divergence happens at a different time point. So the divergence isn't entirely deterministic :/. There are a few old friends that are suspicious for something like that (CLONE_CLEAR_TID and SKID_REGION hacks come to mind).
Interestingly, the symptom on the second trace is pretty much the same: a byte has the bits x11xxxxx
set in replay, but they weren't set in recording. Coincidentally (?), this divergence also occurred near some cairo-xlib code. The bits were last set by ConvertUTF8toUTF16::write()
, when converting mozilla-central/dom/locales/en-US/chrome/dom/dom.properties
. So we're "stuck" at the same place we were with the first trace.
Third trace is another mystery bit, xx1xxxxx
this time, apparently set by js::NewDenseCopiedArray
. Don't see anything suspicious in there.
Still making forward (albeit very slow) progress on this lead. The memory divergences I've been seeing made me think of using checksumming to see if there's some "source" for the divergent values. (My kingdom for valgrind-on-rr!) Using a new blend of heuristics, I was able to get checksumming working again on FF runs. The result was very interesting: there are pages being allocated by SM (js::gc::Chunk::allocate
) that seem to have persistent and and increasing number of divergent values in bookkeeping metadata (Chunk::init(JSRuntime *rt)
). This is a bit hard to believe, so the tools are suspect, but at any rate I put it aside.
I captured a trace in which there's a checksum divergence and an overshoot to see if the two were related. They don't seem to be, but the overshoot proved very interesting. It's again two "mystery bits", 0bxx11xxxxx
, diverging in a bitfield byte. The bits are originally set by MOZ_Z_inflate_fast()
on behalf of js::FunctionToString()
, which is reminiscent of the ConvertUTF8toUTF16
divergence above. However, the member I need to find the script source location was optimized out, and I can't make the function calls I need to reconstruct it :/. This again doesn't mean the bits weren't overwritten.
So more leads to follow.
This again doesn't mean the bits weren't overwritten.
Or not overwritten when they should have been.
The bits are originally set by MOZ_Z_inflate_fast() on behalf of js::FunctionToString() [...] However, the member I need to find the script source location was optimized out, and I can't make the function calls I need to reconstruct it :/. This again doesn't mean the bits weren't overwritten.
This particular symptom is moderately easy to reproduce. I've got it tracked now, in one trace at least, to writing a ' '
char to the heap, in trim: function( text )
, as part of decompressing the source of dom/tests/mochitest/ajax/jquery/dist/jquery.js
. There's no nondeterminism around here, so it's looking pretty strongly like bits not being cleared that were cleared during recording. My jury-rigged instrumentation can't be extended to catch those kinds of bugs (nontrivially), so I'm ready to throw in the towel. (I have the nagging sensation that if I were cleverer, the answer may still be reachable with what I've got so far, but things are what they are ;) .)
To finish off, I re-checked that a couple "dumb" things weren't the problem (i) direct mmaps; no, copying all mapped files still diverges; (ii) CLEAR_TID-race-sleep timeout; no, setting it to ridiculous 1s still diverges.
I spent a little time this morning to characterize the symptoms we're still seeing here, and approximately how frequent they are. (Got distracted by another bug before posting, sorry.) There are 5 main symptoms I'm seeing (on a mochitest sub-suite I'm focusing on), which may or may have shared causes
write
) by 1 or 2 rcbs. That is, at entry to syscall X, the rbc is X - 1, but it should have been X. This might be a HW issue, but it's not clear yet.With "upstream" rr, I see symptom (1) about half the time, and symptom (2) a bit less than half. The others reproduce infrequently. On an idle system, (1) predominates. On a loaded system, (2) predominates. (I didn't crunch these numbers but can if someone wants them.)
I tried adding a hack to expand the check in (1) to be target - 1 <= rcb_now <= target + 1
, on the theory that it might be a HW bug to work around. The new frequencies depend on whether the system is idle or busy. (These numbers are based on a relatively small sample, to take with a grain of salt.)
All this suggests that
(This is mostly relevant to external folks, because the debugging approach we would take at the rr level is the same for all these symptoms :). I have glibc just about set up to make an instrumented build.)
Great lead on symptom (1) above: on an uninstrumented build, it seems to almost always reproduce under nsGlobalWindow::Dump()
. If I add a syscall(SYS_gettid)
just before Dump calls into fputs, then the overshoot still reproduces ... but with a very small number of rcbs (~100). That's pretty telling evidence that something in fputs is diverging. Now that I have a local libc build, I can dig further.
I have the problem bookended around a strlen
in libc, like so
syscall(SYS_gettid); // (1) strlen(...); // (2) syscall(SYS_getpid); // (3)
Everything is fine at (1). When we get to (3), then $edx and $ecx have diverged from replay. (I'd like to hope this is the "mystery bits" symptom, too.) So strlen is highly suspect, especially because it uses some fairly exotic features.
What's most odd about this is that it doesn't diverge without the syscallbuf. That makes me suspect a kernel bug, in which case we have some workarounds we can bring to bear. But will get this nailed down as far as I can asap.
Exciting :-)
I've been testing whether we can switch over to using the insns-retired counter ("irc") instead of rbc. That would solve a lot of problems, among them (most likely) resolving this bug. The results are mixed but somewhat promising.
The irc is way off, "out of the box". Adjusting counts per this paper [1], things don't change much. But subtracting out the stepi's and breakpoints set during replay puts counts in the right ballpark.
First issue is that, on a test that generates only timeslice events, irc doesn't 100% agree with rbc; irc is off by +/-1 occasionally, compared to recording at the target rbc. There's an outside possibility they're both right (e.g. if the test is running insns that don't affect the GPRs, or retired branches are counted differently for irc), but I doubt it. However, it's not really a fair test because we're programming rbc interrupts during replay instead of insns-retired, so the PMU may be behaving slightly differently. That difference is investigate-able.
Second issue is that on a heavily syscallbuf test, the irc disagrees with replay by 10-20 insns. I haven't looked into that deeply yet but it smells like an rr bug, not accounting for some source of overcount (or capturing irc at the wrong place in recording).
Third issue is that the fact that stepi's and breakpoints affect the irc means that hitting execution targets when replay is driven by gdb is going to get harder. But that's "just work".
Switching to irc interrupts solves enough problems that I'm going to keep poking at it until it gets us hard-stuck.
[1] http://web.eece.maine.edu/~vweaver/projects/deterministic/ispass2013_deterministic.pdf
Have been spending some time investigating "the strlen() lead" so that we can be more confident that switching to programming the irc will fix it. What I've found so far
clone
and exec
, for reasons I don't understand. But most importantly, they match at the strlen() divergence. Feh.I also took a closer look at the strlen implementation that my custom-built glibc is running (the one that I've been catching the recent divergences on), and turns out it's not using any exotic features: my build ends up running some i586 "fallback", for whatever reason. That impl doesn't do anything exotic; instead it implements a cute some-zero-byte-in-word detection algorithm, along with some manual loop unrolling, along with some ILP scheduling "guesses". No extension registers or insns used. But it is hand-written asm intended to keep the superscalar pipelines full. No guesses to be made yet, but whatever the problem is looks quite nontrivial.
Next order of business is to see if the strlen() code being context-switched in recording may correlate with replay divergences.
Next order of business is to see if the strlen() code being context-switched in recording may correlate with replay divergences.
After a failed digression to try writing a small strlen
test to repro more quickly, and a mini comedy of errors with the perf context-switch event, I finally got some results: during recording, the event reports 2 switches. During replay, at the point we diverge, only 1 switch is reported. But, that happens at all the nondiverging previous events I looked at too. This doesn't count cases where the task traps to the kernel for whatever reason, and another task isn't scheduled in its place. (The desched machinery relies on that!)
So it's still possible that a non-descheduling interrupt fires during the recording. The docs for the HW_INTERRUPTS event we monitor are pretty spotty, but seem like they should serve this purpose. Indeed, if I check USR_SCHED events, they typically record 1-3 "interrupt" events, and 1 or 2 context switches. (Not sure whether the rbc interrupt itself is counted there; I do see a USR_SCHED with 0 interrupts, but that might be due to some other effect It also shows 294 context switches.).
So the interrupt trail is going a bit cold. There's more to do on it, but I think #597 (eflags divergence) is the next easiest and most potentially informative direction.
I went ahead and ruled out a few things that were easier to test than digging through #597
strlen
divergence still pops up. Similar symptoms for usleep(1000); fetch;
, and fetch; usleep(1000); fetch
. So pretty clearly the timing, or context switching, affects the divergence, but the tracer->tracee request isn't the problem. (In theory the kernel could be botching the register-save, but that seems about 0% likely.) So this is strongly ruled out.syscall(gettid); strlen(); syscall(getpid)
, we see the register divergence at syscall(getpid)
. If I add 32 nop's before that call, then the symptom is the same. Not ruled out, but seems more unlikely.getpid
above is made through a glibc INLINE_SYSCALL()
helper that calls into the vdso through tcb info. Replacing that with int $0x80
didn't change anything.Recording what we discussed on IRC for posterity: The code for strlen is viewable here: http://repo.or.cz/w/glibc.git/blob/0d204b0a522b9fef2168e1808a0566c9209d0893:/sysdeps/i386/i586/strlen.S An interesting possibility is that the parity bit gets corrupted somehow just before the "jp" instruction. That would cause a potential rbc mismatch, and the strlen code would still work and give the right answer, but with ecx and edx diverged.
We could test this, and other hypotheses, by capturing the code coverage of each strlen call with some crude instrumentation. One approach would be to create a single-byte global flag for each basic block (I count 20), clear them all to zero before the call to strlen, and in each basic block of strlen use a single instruction to store 1 to its flag. And of course after the strlen call, capture the values to a log. This would not affect registers or EFLAGS and I'm hopeful we could still reproduce the bug. (To be even less invasive, use the free registers (bl, bh, esi, edi, ebp) instead of global flags and instrument 5 basic blocks per experiment, although that gets more complicated to analyze.) Knowing the basic blocks that were executed in both the record and replay, plus the values of edx and ecx, and the address and contents of the string, I think we could deduce almost everything about the execution of strlen in record and replay.
Actually I've got a slightly better idea for the instrumentation. Number the basic blocks starting after each label, from one, use one variable per label, and record how many blocks we executed through after jumping to the label. There are 5 labels, and conveniently we have 5 usable registers.
I guess we have less registers than that depending on which ones the caller restores in the bookending code. But using registers has the advantage of being thread-local.
Another interesting idea was that the seccomp-bpf code is corrupting eflags. I recorded without the syscallbuf, but with two different filter progs, but didn't see a divergence. That should have been a pretty representative test, because the syscalls that bookend the diverging strlen are made directly through syscall()
and so are traced. So that hypothesis looks like a "probably not it".
It's taken me a bit of time to get some asm instrumentation put together, but after working through the annoyances (and with a final bugfix tip from @andreasgal) it's working. I started out with the barest-bones instrumentation, just setting a bit in $edi for each labeled branch target. IIUC that should have been able to rule out a corrupted PF, since L(0) should be hit record XOR replay (and that corresponds to a bit that should have diverged). But unfortunately, I recorded executions where ecx diverged (I need to use edx for instrumentation, so its likely-diverging value is stomped) and the rbc values mismatched ... BUT the edi flags were the same. So I think we can rule out a PF issue.
Next step is to collect block counters.
Next step is to collect block counters.
The last commit counts the number of times the labeled blocks (L(0), L(1), L(2), L(3)) are executed. The same $ecx (and presumably $edx, if it weren't being stomped) divergence is seen. rbc also differs. But, the block counts are the same (!!).
The last suspect seems to be the manually unrolled loop in L(1). If that's indeed where we're diverging, then it would seem that our rbc interrupt isn't actually precise after all.
The last suspect seems to be the manually unrolled loop in L(1).
Nope. On to BBs in L(3).
Nope. On to BBs in L(3).
Nope. scratches head
The zero-testing instrumentation allowed me to easily deduce what $ecx was supposed to be for this execution. The results very strongly indicate that recording is catching CPU state from speculative execution. Yikes! Here's my line of reasoning
The divergence is
-> ecx 0x7720000a != 0xa (replaying vs. recorded) [EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 20100 (rec:19900) at trace line 128623) -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch (rbc/rec:83/82; irc/rec:388/385)
IO_fputs
is calling strlen(str=0x5fa42f20 "947 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: CSS Auto to show - Showing, display should block: block: block\n")
. So the strlen is 153, and the NULL byte is at address 0x5fa42fb9. The bytes around the end of string are
0 1 2 3 4 5 6 7 5fa42fb8 0a 00 20 77 0a 00 00 00
The value of $ecx in replay is 0x7720000a (see above), and in recording it was 0x0a. I.e. the replayed value is
$ecx = 0x 77 20 00 0a
This replayed value is "correct" per the strlen algorithm (executed on a hypothetical in-order non-speculating CPU). The recorded value is "wrong" (on the same hypothetical CPU). Comparing to the bytes around the end of the string, it's further instructive to see how it's wrong.
First, the word that was fetched at address 0x5fa42fb8 is 152 bytes past the start of the string, 0x5fa42f20. That means it's the 38th word fetched, which further means it was fetched in the third "iteration" of the unrolled loop at L(2).
Next, assume that the CPU speculatively executed the fourth iteration of the unrolled loop. (This is reasonable to assume because the jnc L(3)
in the third unrolled iteration would be > 99% predicted to not be taken.) What values would be loaded into $ecx? If just the movl
(and possibly the following addl
, doesn't matter) were speculated, then 0x00 00 00 0a
would be loaded into $ecx. The lower two bytes are indistinguishable from the "correct" value of 0x772000a, and the upper two bytes match the "incorrect" value 0x0000000a.
Maybe because the zero-testing code is only reading $cl and $ch, and the "correct" values of those happen to be the same as the "incorrect" speculated value, the CPU somehow isn't determining that it should roll back the speculation? I have no clue.
I'll follow this deduction chain for another divergence.
I should add that I don't have a hypothesis to explain the divergent rbc/irc numbers.
I'll follow this deduction chain for another divergence.
The results are basically the same, except (i) no rbc/irc divergence; (ii) the possibly-speculated value doesn't match in %cl / %ch. I won't go through the long-winded chain of reasoning again, but here's the raw data
-> ecx 0x90000 != 0x0 (replaying vs. recorded) [EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 20894 (rec:20842) at trace line 157932) -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch (rbc/rec:97/97; irc/rec:453/453) Recorded regs: eax:0xffffffda ebx:0x1 ecx:0x0 edx:0x1 esi:0x0 edi:0x0 ebp:0x0 str=0x6960b880 "1154 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: Chain fadeIn fadeOut - Make sure that overflow is reset (Old: visible Cur: visible): visible\n" strlen(str) = 184 0x6960b938 - 0x6960b880 = 0xb8 = 184 184 / 16 = 11 rem 2 == third unrolled loop Replayed $ecx = 0x 00 09 00 00 vs. recorded $ecx = 0x 00 00 00 00 0 1 2 3 4 5 6 7 6960b938 00 00 09 00 00 60 00 00
So a speculated load of the next word that wasn't rolled back would explain the difference in $ecx values.
I caught another one of these that's much different. Raw data
-> ecx 0x4500 != 0x0 (replaying vs. recorded) [EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 22418 (rec:22363) at trace line 129089) -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch (rbc/rec:79/79; irc/rec:372/372) eax:0xffffffda ebx:0x1 ecx:0x0 edx:0x1 esi:0x0 edi:0x0 ebp:0x0 str=0x60455e80 "943 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: CSS Auto to 100 - Final width should be 100: 100px: 100px\n" strlen(str) = 148 0x60455f14 - 0x60455e80 = 0x94 = 148 148 / 16 = 9 rem 1 == second unrolled loop Replayed $ecx = 0x 00 00 45 00 vs. recorded $ecx = 0x 00 00 00 00 0 1 2 3 4 5 6 7 60455f14 00 69 00 00 00 35 81 84
I'm not sure why $ch is 0x45 instead of 0x69, but I haven't worked out what the bytes should be for both any-byte-in-word-is-zero tests in the unrolled loop iterations.
A not-rolled-back-speculative-load doesn't explain this divergence. However, another hypothesis emerges: perhaps there's a speculative load to $ecx, but the register bytes past the last read of $cl/$ch are rolled back to zero. Looking over the three executions
Replayed $ecx = 0x 77 20 00 0a vs. recorded $ecx = 0x 00 00 00 0a (NULL byte is in $ch) Replayed $ecx = 0x 00 09 00 00 vs. recorded $ecx = 0x 00 00 00 00 (NULL byte is in $cl) Replayed $ecx = 0x 00 00 45 00 vs. recorded $ecx = 0x 00 00 00 00 (NULL byte is in $cl)
So the bytes up to the last read of cl/ch match, but the ones following it are all 0x00. That's a pretty weak hypothesis. Another one is that some kind of interrupt-handler (??) code clears $ecx but doesn't properly save/restore in recording and not in replay.
$ch is 0x45 instead of 0x69
Oops, that's because I dumped something in decimal that supposed to be hex. 69 == 0x45.
Another run
-> ecx 0x5bcdd600 != 0x0 (replaying vs. recorded) [EMERGENCY] (/home/cjones/rr/rr/src/share/util.cc:608:compare_register_files: errno: None) (task 23262 (rec:23210) at trace line 131626) -> Assertion `!bail_error || !err' failed to hold: Fatal register mismatch (rbc/rec:80/79; irc/rec:375/372) eax:0xffffffda ebx:0x1 ecx:0x0 edx:0x1 esi:0x0 edi:0x0 ebp:0x0 str=0x5fde5c00 "972 INFO TEST-PASS | /tests/dom/tests/mochitest/ajax/jquery/test_jQuery.html | fx module: JS Auto to 100 - Final height should be 100: 100px: 100px\n" strlen(str) = 148 148 / 16 = 9 rem 1 == second unrolled loop Replayed $ecx = 0x 5b cd d6 00 vs. recorded $ecx = 0x 00 00 00 00 0 1 2 3 4 5 6 7 5fde5c00 00 d6 cd 5b 01 00 00 00
So nothing new added.
Another possibility is that the word just past the end of the string is being written racily somehow.
Another possibility is that the word just past the end of the string is being written racily somehow.
I should add, this looks more likely because I'm now logging $edx, and the divergent ecx/edx value pairs seem to be consistent with taking different L(1) loop exits across record/replay. (That's also a much more plausible hypothesis than a CPU bug!) Testing by instrumenting nsGlobalWindow::dump().
Another possibility is that the word just past the end of the string is being written racily somehow.
It's this. The symptom is similar to other leads that have gone cold in the past: some memory cell is cleared in recording but isn't cleared during replay. This is pretty hard to track down because we don't know a priori which cell should be cleared during recording. I've got some instrumentation that's throwing off leads, but nothing conclusive yet.
It strikes me that if rr could record rr record
, then we could sort this bug pretty easily. That's been a tongue-in-cheek goal but maybe we should take it a little more seriously.
It strikes me that if rr could record rr record, then we could sort this bug pretty easily.
Actually, on second thought, I don't have any reason to expect that the same divergence behavior wouldn't also happen in the "outer" recording. So nm.
The memory contents diverge before the call to strlen, right?
Are there any patterns governing the alignment of the diverging memory range? How about its length? Has anything been allocated that overlaps the diverging memory range? (I assume that in these dump() cases the diverging memory range is in the JS heap?)
The memory contents diverge before the call to strlen, right?
Right. My instrumentation is
// Record the string bytes rounded up to a multiple of 16. char* cEnd = cstr; while (*cEnd) ++cEnd; syscall(4/*SYS_write*/, -42, cstr, ((cEnd - cstr) + 15) & ~15);
inserted just before the call to fputs()
here.
Are there any patterns governing the alignment of the diverging memory range? How about its length?
The key to the length is that the \0
byte must be at index i where i % 4 != 3. That means the last word-load will read bytes past the end of the string. Beyond that, the lengths have been around 140-200 bytes. (See below about alignment.)
Has anything been allocated that overlaps the diverging memory range? (I assume that in these dump() cases the diverging memory range is in the JS heap?)
The string is allocated by ToNewUTF8String()
, which (through a torturous route) ends up allocating from jemalloc. IIRC jemalloc hands out 16-byte-aligned segments (or maybe it was 8). Either way, the "garbage" memory that strlen is touching should be within the jemalloc block.
Somewhat like #416, except that
Smells to me like a memory divergence, but I don't have any specific guesses yet. Reproduces with
but doesn't seem to reproduce with the (quick) network reftests used for #416.