CPUID-triggered rcb undercounts are not reliably handled

joneschrisg commented 10 years ago

... and then replay diverges :/. I've never seen this warning printed on my Ivy Bridge machine before, and IIRC this was a Haswell bug.

joneschrisg commented 10 years ago

What seems to cause the problem is when I restart with an new event target (e.g. r NNN).

joneschrisg commented 10 years ago

I can reproduce a variety of bad behaviors with explcitly-targeted restarts, not always this one.

joneschrisg commented 10 years ago

This turns out to be a pretty dumb bug, but I need to think for a bit about how to write a test ...

joneschrisg commented 10 years ago

So the dumb bug is that ReplaySession doesn't reset the EnvironmentBugDetector state when a session is fully restarted (i.e. rewound to beginning-of-trace). That's easy to fix, but I think the better solution is to get rid of ReplaySession::restart().

However, there's a much worse bug, I think. I can easily make the spurious bug-detection trigger with a simple test, but when it does, no overshoot. Really, there's no reason accidentally turning on the workaround should break anything, since it's just supposed to add extra slack.

But I think there's a problem here

https://github.com/mozilla/rr/blob/2b3ede9d8cfc01708151d042c5d15e9410959fcc/src/replayer.cc#L1015

If in the interval [target_rcb - 6, target_rcb) we end up seeing the same register set as at the target, then is_same_execution_point() will spuriously return true. Then at the next target or syscall, we'll see an rbc overshot by 6 or so. (I'm seeing an overshoot by 8, which I think I understand.)

I don't have any ideas for fixing this :/. (Nothing that wouldn't totally destroy perf, anyway.) @rocallahan how about you?

joneschrisg commented 10 years ago

Assume the following happens

CPUID bug is detected
tracee has rbc target x with register set R
tracee runs up to x - 6
by chance, its register set is R
is_same_execution_point() returns true
...
assume same tracee is scheduled to advance to rbc target y
tracee will have run something like y + 6 rcbs to reach the target, because it stopped short of the first target x

Am I missing a fixup step or otherwise thinking about this wrongly?

If this is a bug, I think I do have a potential workaround, but urgh it's complex!

rocallahan commented 10 years ago

If in the interval [target_rcb - 6, target_rcb) we end up seeing the same register set as at the target, then is_same_execution_point() will spuriously return true.

The assumption behind rcb slack is that this never happens. We're already assuming that across all instruction executions whose rcb is equal to target_rcb, GP registers are unique; rcb slack just extends that window to a wider range of rcbs. As you pointed out on IRC, the consequences of a spurious stop are worse with rcb slack though.

If you found a Firefox divergence with rcb slack enabled, then we probably should fix it; I found that on my Ivy Bridge laptop, I can get the CPUID bug in VMWare.

I look forward to hearing what you have in mind :-). One obvious but complex approach would be, when we see our target register set in the slack window, clone the current state and execute forward in the clone to see if there is also a matching register set with an exact match for the rcb value, and if there is, disable slack for this stop.

joneschrisg commented 10 years ago

One obvious but complex approach would be, when we see our target register set in the slack window, clone the current state and execute forward in the clone to see if there is also a matching register set with an exact match for the rcb value, and if there is, disable slack for this stop.

That was close to what I had I had in mind, but I'm scared of potentially checkpointing a lot. I think we could get pretty much the same effect by saving the undershot rcbs and regset on Task, and then at the next continue request, gingerly step up until the end of the undershot region. If we hit target regset at the end of undershot region, then continue on normally, otherwise count those few rcbs as part of the next event's execution. (Or if we hit the next event just continue normally.) But yeah, that's really really complex ...

joneschrisg commented 10 years ago

I had suggested making the assertions fuzzier, but I realized that won't work because tracees can genuinely diverge under the right unlucky conditions.

joneschrisg commented 10 years ago

I usually say this semi-facetiously, but of course another potential solution is switching to irc.

rocallahan commented 10 years ago

I'm scared of potentially checkpointing a lot.

I'm not. I was able to record and replay some Firefox runs with the current code with rcb slack enabled, which suggests we don't hit these spurious stops very often.

If you think instructions-retired can work, feel free to put some time into it :-). It would be great.

joneschrisg commented 10 years ago

I'm not. I was able to record and replay some Firefox runs with the current code with rcb slack enabled, which suggests we don't hit these spurious stops very often.

Well there are two issues. We spuriously stop in the slack region fairly frequently; if I lower the slack to 2 rcbs, we spuriously stop there ~30 times in a ~200,000 event trace in the recording I'm debugging. We'd have to checkpoint at each of those stops. Running JS code seems to be most vulnerable to this.

The other question is how often we hit the unlucky combo that trips the divergence assertion. Dunno about that, obviously rarer but didn't take me too long to hit.

rocallahan commented 10 years ago

200K events is a lot of execution. 30 clones in that interval might not be too bad.

And what if we lower it to 1 rcb worth of slack? I suspect that's probably going to be enough.

I think we could get pretty much the same effect by saving the undershot rcbs and regset on Task, and then at the next continue request, gingerly step up until the end of the undershot region. If we hit target regset at the end of undershot region, then continue on normally, otherwise count those few rcbs as part of the next event's execution. (Or if we hit the next event just continue normally.) But yeah, that's really really complex ...

This might not be too complex if we hide it completely inside Task and only tackle common cases. Basically, record the undershoot amount in Task. If we get a continue request with no stepi or anything else tricky (large-ish rcb threshold would be OK), first try advancing to the target register state after 'undershoot' more rcbs. If we don't hit it, just continue normally. If we do hit it, subtract the undershoot amount from Task's rcb count (to undo the correction we applied in advance_to) and continue normally. If we hit a user breakpoint or anything tricky like that, give up on trying to check/recover from the slack.

rocallahan commented 10 years ago

But maybe it's better to investigate retired-instructions than to sink more time into more workarounds.

joneschrisg commented 10 years ago

The complexity I fear is when the next event is a syscallbuf reset or flush.

We can go at this from both directions. I can revive the irc branch, and you or I can try the workaround in parallel.

joneschrisg commented 10 years ago

200K events is a lot of execution. 30 clones in that interval might not be too bad.

It's 47s. 30 clones will probably take around 1.5s based on my earlier measurements. So it's not the end of the world but potentially noticeable. Also, the spurious stops aren't uniformly randomly distributed, they seem to cluster, possibly near JS VM.

And what if we lower it to 1 rcb worth of slack? I suspect that's probably going to be enough.

That reduces the spurious stops from 30 to 1 compared to a slack of 2 rcbs. It would be nice to do some more QA on a Haswell machine with that slack, though ...

joneschrisg commented 10 years ago

And what if we lower it to 1 rcb worth of slack? I suspect that's probably going to be enough.

That reduces the spurious stops from 30 to 1 compared to a slack of 2 rcbs. It would be nice to do some more QA on a Haswell machine with that slack, though ...

Regardless of ability for core rr folks to QA, I think it's worth pushing this patch out now. I suspect a slack that's too small to work around the CPUID bug will manifest much more easily than the unfortunate-series-of-SCHED problems we see with a too-large slack.

@rocallahan does that match your intuition?

rocallahan commented 10 years ago

It would be nice to do some more QA on a Haswell machine with that slack, though ...

As I said on IRC, I can reproduce the bug on my Ivy Bridge laptop. So I can cut it to 1 and see if that works.

However, we would still need the clone-and-check. Do you think that's easy?

joneschrisg commented 10 years ago

As I said on IRC, I can reproduce the bug on my Ivy Bridge laptop. So I can cut it to 1 and see if that works.

I remember you mentioning that but I interpreted it as an existence proof type statement for some reason. Sure, that sounds good. I have the trivial patch with a medium length comment, but it would be better for you to write that.

However, we would still need the clone-and-check. Do you think that's easy?

Dunno. I fear it'll get complicated. I think it's worth spending a small amount of time to know where we're at with irc first, to make a good decision.

joneschrisg commented 10 years ago

That was close to what I had I had in mind, but I'm scared of potentially checkpointing a lot. I think we could get pretty much the same effect by saving the undershot rcbs and regset on Task, and then at the next continue request, gingerly step up until the end of the undershot region.

Ignore this; the problem isn't performance per se, but if the stop actually was spurious and we don't advance the tracee to its event target, then replay can diverge. (Same problem as with indeterminate targets with slack=0 rcb, but expanded out to slack=k rcb.)

rocallahan commented 10 years ago

I discovered a few Ubuntu-specific regressions and pushed fixes for them to master. With those fixed, tests pass. However, recording and replaying Firefox does not work. We diverge, presumably due to the issue discussed here.

Furthermore, reducing the rcb slack from 6 to 2 causes tests to start failing :-(.

joneschrisg commented 10 years ago

Did some quick testing with the "more precise" rcb_cntr interrupts in #610, and results aren't helpful. We currently use the "arbitrary skid region" event (precise=0). With the "constant skid region" event (precise=1), we overshoot on the dead-simple chew_cpu test in a way reminiscent of the CPUID bug, but the bug isn't detected. If I turn on the workaround for the CPUID bug, then we still overshoot, albeit differently, but just like in this bug.

Interestingly, if I also turn off EV_SCHED coalescing, then replay succeeds pretty reliably. I don't have a guess as to why that might be.

But no-SCHED-coalescing and precise=1 recording might be a workaround incantation for cpuid bug systems.

rocallahan commented 10 years ago

As an experiment I tried using rcb slack when checking rcb at syscalls and other events, but not using it in advance_to. This approach passes rr tests (I think because CPUIDs are quickly followed by system calls in those tests). It does not work with Firefox; we diverge during JPEG decoding, probably because jpeg_simd_cpu_support triggers the CPUID bug and there are no system calls anywhere near it.

rocallahan commented 10 years ago

But no-SCHED-coalescing and precise=1 recording might be a workaround incantation for cpuid bug systems.

I tried changing to BR_INST_RETIRED:COND:u:precise=1 but I get FATAL /home/roc/rr/rr/src/hpc.cc:181:start_counter() errno: 95 'Operation not supported'.

rocallahan commented 10 years ago

I've prototyped the simple approach of making a clone and running it forwards to see if we have a better stopping point in the future. It does seem to work, at least in this one Firefox run that I can't replay otherwise. It is, of course, rather slow: 213 clones with 6-rcb slack, in this 223605-event trace (starting Firefox, loading nzherald.co.nz and scrolling around a bit).

rocallahan commented 10 years ago

Reducing rcb slack to 2 still lets replay work and requires only 68 clones. Slack 1 requires 21 clones. (I'm surprised it doesn't diverge though.) Slack 0 diverges.

rocallahan commented 10 years ago

I do wonder why the CPUID rbc undercount is so reliably different between recording and replay. You would think that if there's a counting bug somewhere that it would affect recording and replay equally unless replay happens to stop execution near a CPUID instruction, which surely doesn't happen that often.

rocallahan commented 10 years ago

To put this in perspective, in this replay I'm looking at, slack=0 diverges during JPEG decoding. jpeg_simd_cpu_support is called 100160 rcbs before the next rr event (SCHED), and 2846 rcbs after the previous rr event (a SYSCALLBUF_FLUSH of clock_gettime). What could cause some hardware or hypervisor bug to trigger only on recording, not replay, in the middle of such a large interval where we're executing exactly the same code? And to do this in a way that seems quite reliable?

rocallahan commented 10 years ago

One thing that's different between recording and replay is that during recording the rbc period is 0. So I tried adding reset_hpc(this, 100000000) to Task::resume_execution so recording always runs with a nonzero rbc period. With that, and 0 slack, Firefox recording and replay works; at least two runs did, whereas before I had a 100% record of reproducing divergence.

rocallahan commented 10 years ago

Note that normally during recording we do have a nonzero rbc period, but the rbc_count() at the start of Task::resume_execution sometimes kicks in (when the hpc count has not been read since the last event, basically) and in that case it calls reset_hpc with a zero period. Which is wrong, because it wipes out our desired reschedule value. But that's what happens.

rocallahan commented 10 years ago

PR #1268 fixes the issues with reset_hpc()s clobbering each other's rbc periods. It also forces the rbc counter to always have a nonzero period.

With that patch, plus a patch to set slack to 0, recording and replaying Firefox with my simple load-a-page-and-scroll test works all the times I've tried it so far. Most tests pass, except

cpuid with/without syscallbuf often fails. I suspect this is because the undercount bug is usually hit during both recording and replay, but once in a while scheduling interrupts the process and causes the bug to not appear. Note that cpuid executes 1000 CPUID pairs in a loop so by design it's a lot more susceptible to these interrupts than a normal workload.
at_threadexit with syscallbuf disabled often fails. I don't really know why record and replay would differ here, but libgcc's CPUID code triggers it, and that code is in an execution window of just over 2000 rbcs (with syscalls at either end), so maybe it's close enough to one end that that can disrupt the bug.
condvar_stress with syscallbuf disabled is similar to at_threadexit.

rocallahan commented 10 years ago

I think we should take #1268 because it improves the code, regardless of what we do in this bug.

rocallahan commented 10 years ago

If we can prove that counting rbcs without an imposed period avoids the CPUID undercount bug, then another thing to look at is whether we can program two counters to both count rbcs and only limit one of them, and if the unlimited one gets the correct value. If we got that lucky, then we could totally solve the problem.

joneschrisg commented 10 years ago

That sounds feasible to me. I like the idea of limiting the assumptions rr makes about the underlying HW/kernel behavior as much as possible.

There's some overhead in managing the counters, but as long as we only manipulate the interrupting counter during resched we should be fine.

joneschrisg commented 10 years ago

With that patch, plus a patch to set slack to 0, recording and replaying Firefox with my simple load-a-page-and-scroll test works all the times I've tried it so far. Most tests pass, except

If the slack is set to 1, do those tests still fail?

rocallahan commented 10 years ago

If the slack is set to 1, do those tests still fail?

Yes. At least 2 would be needed, maybe more.

rocallahan commented 10 years ago

Removing all rcb limits in the recorder does not make the bug go away. So that's weird.

rocallahan commented 10 years ago

I've run a variety of Firefox workloads with latest master and slack 0: loading and scrolling nzherald.co.nz, loading tbpl, loading GMail, and runinng the first 800 SVG reftests (at which point my VM ran out of disk space. Need compression!). They all worked. The reftest workload fails with slack 6, but seems to run OK with slack 1.

Rather than try to pick a magic slack value, I'd rather set the default to 0, add an rr option to opt into particular slack values, and annotate specific tests with that option as needed. And then release 2.0.

rocallahan commented 10 years ago

@cgjones how does that sound?

joneschrisg commented 10 years ago

I've run a variety of Firefox workloads with latest master and slack 0

Somewhat as an aside, I'd like to make a common checklist we can run where we agree things work "well enough". I started one on this wiki page. It doesn't include any "live" FF workloads, that would be good to add.

In particular, I like to run the media, RTC, remote browser, and worker tests because they all have pretty distinctive execution "styles". In addition to basic layout/script tests, of course.

Rather than try to pick a magic slack value, I'd rather set the default to 0, add an rr option to opt into particular slack values, and annotate specific tests with that option as needed. And then release 2.0.

That makes me slightly uncomfortable. Is your assumption that it'll extremely rare to hit these divergences in the field, so users won't have to care about the slack? Or do you have some UI in mind to help users know when they should force slack?

Here's a semi-crazy idea: what if we tried to automatically adjust the slack for users? If we detect the cpuid bug, then we could do something like checkpoint every 5000 events. If we hit an overshoot, then we could roll back to the checkpoint and increase the slack.

But on the whole I feel OK about releasing 2.0 from where we are now.

rocallahan commented 10 years ago

Is your assumption that it'll extremely rare to hit these divergences in the field, so users won't have to care about the slack?

Yes.

My impression so far is that in real Firefox runs slack causes divergence but doesn't cure any, so setting a global slack value isn't going to be useful to users.

One option is to repeat what I tried earlier and allow slack when we're expecting the next event to be a system call (since that's basically harmless), but use 0 slack when locating a stopping point in advance_to, which is where slack is dangerous. I think that would fix tests and not harm real Firefox runs.

rocallahan commented 10 years ago

BTW unfortunately I configured my VM with only 20GB of disk which means I can't do long Firefox test runs without reconfiguring everything :-(. Or implementing compression...

joneschrisg commented 10 years ago

One option is to repeat what I tried earlier and allow slack when we're expecting the next event to be a system call (since that's basically harmless), but use 0 slack when locating a stopping point in advance_to, which is where slack is dangerous.

That sounds pretty good. Would that be easy for you to try? (I would but can't repro the divergence symptoms realistically.)

rocallahan commented 10 years ago

Yes, I can try that, but I have higher-priority stuff to work on for the next few days.

rr-debugger / rr

CPUID-triggered rcb undercounts are not reliably handled #1262