rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.16k stars 587 forks source link

rr broken in KVM guests #1952

Closed rocallahan closed 7 years ago

rocallahan commented 7 years ago

We get ticks divergence in various async-signal testcases. rr revision 73ee8002b88b236d95a3001ad2e3a84cdd6a9e85 is the first bad changeset.

rocallahan commented 7 years ago

I actually tested that patch in KVM. Evidently not enough.

rocallahan commented 7 years ago

Clearly tests are running fine in ci.rr-project.org so there are probably host or guest kernel version dependencies involved.

rocallahan commented 7 years ago

ci.rr-project.org is currently running 4.4.0-53-generic #74-Ubuntu. My KVM guest is running 4.4.0-59-generic #80-Ubuntu.

rocallahan commented 7 years ago

I believe ci.rr-project.org is a DigitialOcean droplet so we don't know what the host kernel is. My current host kernel is 4.8.12-300.fc25.x86_64 though I also see this on 4.8.16-300.fc25.x86_64.

rocallahan commented 7 years ago

If I change PerfCounters::read_ticks to always return interrupt_val then everything passes --- except the hle test, of course.

rocallahan commented 7 years ago

If I record just using interrupt_val and then I replay using measure_val then apparently everything is fine too. So somehow, during recording only, measure_val drops events. And not just a few events; in the alarm test I see stuff like expected 4180530, got 4240512.

Keno commented 7 years ago

I see this kind of thing when there's too many performance counters scheduled (happens quite a bit on KNL because it has fewer counters).

rocallahan commented 7 years ago

What do you think happens then? And why during recording only?

Keno commented 7 years ago

If the kernel can't schedule a counter because all counters for that CPU are taken, it'll just not count then. I also think some recent ish kernel change affected this, because I now see this happen a lot more often (on KNL I can work around it by not starting the useless counter anymore, which makes me think that something else is trying to use that fixed function counter on occasion, kicking the useless counter to a general purpose counter and one of the tick counters off the PMU).

rocallahan commented 7 years ago

Hmm.

I don't see how that sort of issue would cause interrupt_val to be always correct during recording (module HLE) while measure_val drops events. We put them in the same group.

Keno commented 7 years ago

Does the host put them in a group though? What happens if there's enough counters from the perspective of the guest, but not enough counters from the perspective of the host?

rocallahan commented 7 years ago

Could be.

I should try switching the order in which we allocate interrupt_val and measure_val.

rocallahan commented 7 years ago

That doesn't do anything.

rocallahan commented 7 years ago

Building a 4.6 host kernel to see if the problem occurs there.

rocallahan commented 7 years ago

I should probably also instrument the PMU code to log the PMU state.

rocallahan commented 7 years ago

Host kernel 4.6 is OK. Host kernel 4.7 is broken. I should probably bisect, though it's really painful since I have to reboot then relaunch the VM etc after each step :-(.

rocallahan commented 7 years ago

Rebuilding the kernel at each step is no picnic either.

rocallahan commented 7 years ago

Hmm no, on 4.6 it worked for a few runs and then stopped working.

Keno commented 7 years ago

I have to reboot then relaunch the VM etc after each step :-(. Rebuilding the kernel at each step is no picnic either.

kexec + startup scripts to launch the VM and run tests? I can also give you access to a big machine to build the kernel on if that would help.

rocallahan commented 7 years ago

I strongly suspect now that when I wrote the HLE patch I never ran the full tests under KVM so there's no reason to believe this ever worked on my KVM guests. Then the only mystery is why it works on DigitalOcean's KVM guest but not mine.

rocallahan commented 7 years ago

It appears when I first reboot the guest, things work for a while, then it gets into a bad state and from then on things don't work. (Each test is just recording and replaying the alarm test a few times.)

rocallahan commented 7 years ago

It looks like things go south after this appears in the kernel log:

[  166.415366] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 5.089 msecs
[  166.415370] perf interrupt took too long (60205 > 38461), lowering kernel.perf_event_max_sample_rate to 3250
rocallahan commented 7 years ago

Though that may just be a coincidence.

rocallahan commented 7 years ago

echo 0 | sudo tee /proc/sys/kernel/perf_cpu_time_max_percent seems to help reduce the failures.

rocallahan commented 7 years ago

The chew_cpu test is best here, presumably because it runs for a pretty long time giving the system a lot of time to drop events.

rocallahan commented 7 years ago

There's a printk_once with kvm_pmu: event creation failed .... Making that a plain printk shows more useful data. I see a lot of

[  421.956816] kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147468743
[  421.956850] kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147468743
[  421.957024] kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147468743

etc

rocallahan commented 7 years ago

-95 is EOPNOTSUPP. Unsurprisingly this is our IN_TXCP counter.

rocallahan commented 7 years ago

What's happening here is that pmc_reprogram_counter is trying to create the IN_TXCP counter with a sample_period that's not 0 or 0x7fffffff as required by the host in hsw_hw_config. That sample period is computed as (-pmc->counter) & pmc_bitmask(pmc).

The interesting thing is that that case is a bit anomalous. We can record lots of things without that condition ever happening. I think pmc_bitmask is returning 0x7fffffff. So the counter value must usually be 0 or 1 for this to work.

rocallahan commented 7 years ago

When the bug triggers, the counter values are somewhat variable:

[  134.435203] @@@ Rejecting checkpointed counter with sample period 2147437227
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437227 intr=1 pmc->counter=-2147437227
[  137.781434] @@@ Rejecting checkpointed counter with sample period 2147437972
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437972 intr=1 pmc->counter=-2147437972
[  138.025408] @@@ Rejecting checkpointed counter with sample period 2147483645
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147483645 intr=1 pmc->counter=281472829227011
[  141.748372] @@@ Rejecting checkpointed counter with sample period 2147437995
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437995 intr=1 pmc->counter=-2147437995
[  145.071246] @@@ Rejecting checkpointed counter with sample period 2147437972
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437972 intr=1 pmc->counter=-2147437972
[  149.301701] @@@ Rejecting checkpointed counter with sample period 2147475889
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147475889 intr=1 pmc->counter=-2147475889
[  149.301738] @@@ Rejecting checkpointed counter with sample period 2147475889
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147475889 intr=1 pmc->counter=-281477124186545
[  149.301901] @@@ Rejecting checkpointed counter with sample period 2147475889
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147475889 intr=1 pmc->counter=-562952100897201
rocallahan commented 7 years ago

I have a theory that maybe this could occur if a tracee task gets scheduled off a CPU and then back onto the CPU (or another CPU) without rr resetting the counter in between, in which case KVM will try to setup a counter with some ticks already counted. That might happen infrequently enough to explain this. When a tracee task gets descheduled it's normally because it trapped to rr for whatever reason, in which case rr will reset the counter before that task gets scheduled again.

rocallahan commented 7 years ago

I haven't managed to disprove this theory yet.

rocallahan commented 7 years ago

Assuming this theory is correct, I'm having a hard time thinking of a workaround. I'm also having a hard time understanding why this works OK on ci.rr-project.org.

rocallahan commented 7 years ago

One possible workaround would to be try pinning our IN_TXCP counter. If pinning fails we should be able to detect that and ignore it ... and hope HLE didn't get used in that interval.

rocallahan commented 7 years ago

Also I think we can/should fix KVM to force it to pass a zero sample_period for checkpointed events (since it's just going to fail otherwise). I'm testing that now but that should make rr work.

rocallahan commented 7 years ago

I think I can probably write test code to detect this KVM bug and run it on every rr session. The question then is what to do if we detect it.

I have a feeling that requesting pinning for rr's IN_TXCP counter won't help here because the failure is silent from the point of view of the guest. We just drop all the samples that occur after the tracee gets rescheduled.

rocallahan commented 7 years ago

This patch fixes rr tests for me:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 06ce377..1cb8046 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -113,12 +113,17 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
        .config = config,
    };

+   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
    if (in_tx)
        attr.config |= HSW_IN_TX;
-   if (in_tx_cp)
+   if (in_tx_cp) {
+       /* HSW_IN_TX_CHECKPOINTED is not suppored with nonzero
+                   period. Just clear the sample period so at least
+                   allocating the counter doesn't fail. */
+       attr.sample_period = 0;
        attr.config |= HSW_IN_TX_CHECKPOINTED;
-
-   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+   }

    event = perf_event_create_kernel_counter(&attr, -1, current,
                         intr ? kvm_perf_overflow_intr :
rocallahan commented 7 years ago

The reason this doesn't show up in ci.rr-project.org is because that is currently running on Sandy Bridge hardware so our attempts to use IN_TXCP "just work"; the EOPNOTSUPP path isn't even reached.

rocallahan commented 7 years ago

I'll try adding a desched counter that we can just monitor from rr. If we detect that the tracee was descheduled more than once and we're running on a system where we can see the KVM PMU bug then just for that perfcounter measurement we can ignore the IN_TXCP counter and use the regular counter instead, and hope we didn't do any HLE in that interval.

rocallahan commented 7 years ago

I've also submitted a path to upstream KVM similar to what I mentioned above.

rocallahan commented 7 years ago

KVM patch submission: http://marc.info/?l=linux-kernel&m=148582794808419&w=2

rocallahan commented 7 years ago

I'm having difficulty getting test code in rr to produce the KVM bug reliably :-(

rocallahan commented 7 years ago

OK, this code reproduces the bug easily. Setting a sample_period of 0 is the key. Then we just need to disable and reenable the counter to trigger the EOPNOTSUPP and subsequent failure to count events.

#include <linux/perf_event.h>
#include <stdio.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <unistd.h>

int main(void) {
    struct perf_event_attr attr;
    int fd;
    long long count;
    memset(&attr, 0, sizeof(attr));
    attr.type = PERF_TYPE_RAW;
    attr.size = sizeof(attr);
    attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
    attr.sample_period = 0;
    attr.exclude_kernel = 1;
    attr.exclude_guest = 1;
    fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
    for (int i = 0; i < 500; ++i) {
        putchar('.');
    }
    read(fd, &count, sizeof(count));
    printf("\nConditional branches: %lld\n%s\n", count,
           count < 500 ? "FAILED" : "PASSED");
    return 0;
}
rocallahan commented 7 years ago

So I can detect the bug just fine now, but my workaround doesn't work reliably. Sometimes we still drop events and fail to detect that, i.e. didn't detect more than one desched.

rocallahan commented 7 years ago

I guess relying on counting descheds is inherently fragile, even if I could get it to work, because it relies on the assumption that the guest OS never sets a counter register except when absolutely necessary, and there's no reason to believe that.

rocallahan commented 7 years ago

I'm thinking of just dropping support for the KVM-bug + HLE combination. If your application uses HLE, you can't use rr on it in a KVM guest whose host has the KVM IN_TXCP bug. We'll disable the rr HLE test when running in a buggy KVM guest. We'll have rr detect the KVM bug, and when detected we stop using the IN_TXCP counter and instead set up an HLE_RETIRED.START counter so we can detect HLE usage. When we detect it, we warn the user and abort unless --force-things is specified.

rocallahan commented 7 years ago

Now I'm gettiing suprious HLE events :-(. Could be related to https://github.com/mozilla/rr/issues/1961

rocallahan commented 7 years ago

It doesn't matter whether I count starts or aborts, I still get spurious events.

rocallahan commented 7 years ago

FTR I only see spurious HLE events on certain tests, e.g. async_signal_syscalls. In that test they're always associated with a SIGSTKFLT being received. The number of events reported is nondeterministic. Could be a hardware issue, could be something in the kernel I suppose.

rocallahan commented 7 years ago

OK another idea is to count IN_TX conditional branches and warn/bail if that rises above zero.

rocallahan commented 7 years ago

Yes, this looks like a winner!