Closed rocallahan closed 7 years ago
I actually tested that patch in KVM. Evidently not enough.
Clearly tests are running fine in ci.rr-project.org
so there are probably host or guest kernel version dependencies involved.
ci.rr-project.org
is currently running 4.4.0-53-generic #74-Ubuntu
. My KVM guest is running 4.4.0-59-generic #80-Ubuntu
.
I believe ci.rr-project.org
is a DigitialOcean droplet so we don't know what the host kernel is. My current host kernel is 4.8.12-300.fc25.x86_64
though I also see this on 4.8.16-300.fc25.x86_64
.
If I change PerfCounters::read_ticks
to always return interrupt_val
then everything passes --- except the hle
test, of course.
If I record just using interrupt_val
and then I replay using measure_val
then apparently everything is fine too. So somehow, during recording only, measure_val
drops events. And not just a few events; in the alarm
test I see stuff like expected 4180530, got 4240512
.
I see this kind of thing when there's too many performance counters scheduled (happens quite a bit on KNL because it has fewer counters).
What do you think happens then? And why during recording only?
If the kernel can't schedule a counter because all counters for that CPU are taken, it'll just not count then. I also think some recent ish kernel change affected this, because I now see this happen a lot more often (on KNL I can work around it by not starting the useless counter anymore, which makes me think that something else is trying to use that fixed function counter on occasion, kicking the useless counter to a general purpose counter and one of the tick counters off the PMU).
Hmm.
I don't see how that sort of issue would cause interrupt_val to be always correct during recording (module HLE) while measure_val drops events. We put them in the same group.
Does the host put them in a group though? What happens if there's enough counters from the perspective of the guest, but not enough counters from the perspective of the host?
Could be.
I should try switching the order in which we allocate interrupt_val
and measure_val
.
That doesn't do anything.
Building a 4.6 host kernel to see if the problem occurs there.
I should probably also instrument the PMU code to log the PMU state.
Host kernel 4.6 is OK. Host kernel 4.7 is broken. I should probably bisect, though it's really painful since I have to reboot then relaunch the VM etc after each step :-(.
Rebuilding the kernel at each step is no picnic either.
Hmm no, on 4.6 it worked for a few runs and then stopped working.
I have to reboot then relaunch the VM etc after each step :-(. Rebuilding the kernel at each step is no picnic either.
kexec + startup scripts to launch the VM and run tests? I can also give you access to a big machine to build the kernel on if that would help.
I strongly suspect now that when I wrote the HLE patch I never ran the full tests under KVM so there's no reason to believe this ever worked on my KVM guests. Then the only mystery is why it works on DigitalOcean's KVM guest but not mine.
It appears when I first reboot the guest, things work for a while, then it gets into a bad state and from then on things don't work. (Each test is just recording and replaying the alarm
test a few times.)
It looks like things go south after this appears in the kernel log:
[ 166.415366] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 5.089 msecs
[ 166.415370] perf interrupt took too long (60205 > 38461), lowering kernel.perf_event_max_sample_rate to 3250
Though that may just be a coincidence.
echo 0 | sudo tee /proc/sys/kernel/perf_cpu_time_max_percent
seems to help reduce the failures.
The chew_cpu
test is best here, presumably because it runs for a pretty long time giving the system a lot of time to drop events.
There's a printk_once
with kvm_pmu: event creation failed ...
. Making that a plain printk
shows more useful data. I see a lot of
[ 421.956816] kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147468743
[ 421.956850] kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147468743
[ 421.957024] kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147468743
etc
-95 is EOPNOTSUPP
. Unsurprisingly this is our IN_TXCP
counter.
What's happening here is that pmc_reprogram_counter
is trying to create the IN_TXCP
counter with a sample_period
that's not 0 or 0x7fffffff as required by the host in hsw_hw_config
. That sample period is computed as (-pmc->counter) & pmc_bitmask(pmc)
.
The interesting thing is that that case is a bit anomalous. We can record lots of things without that condition ever happening. I think pmc_bitmask
is returning 0x7fffffff. So the counter value must usually be 0 or 1 for this to work.
When the bug triggers, the counter values are somewhat variable:
[ 134.435203] @@@ Rejecting checkpointed counter with sample period 2147437227
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437227 intr=1 pmc->counter=-2147437227
[ 137.781434] @@@ Rejecting checkpointed counter with sample period 2147437972
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437972 intr=1 pmc->counter=-2147437972
[ 138.025408] @@@ Rejecting checkpointed counter with sample period 2147483645
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147483645 intr=1 pmc->counter=281472829227011
[ 141.748372] @@@ Rejecting checkpointed counter with sample period 2147437995
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437995 intr=1 pmc->counter=-2147437995
[ 145.071246] @@@ Rejecting checkpointed counter with sample period 2147437972
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147437972 intr=1 pmc->counter=-2147437972
[ 149.301701] @@@ Rejecting checkpointed counter with sample period 2147475889
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147475889 intr=1 pmc->counter=-2147475889
[ 149.301738] @@@ Rejecting checkpointed counter with sample period 2147475889
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147475889 intr=1 pmc->counter=-281477124186545
[ 149.301901] @@@ Rejecting checkpointed counter with sample period 2147475889
kvm_pmu: event creation failed -95 config=2000001c4 sample_period=2147475889 intr=1 pmc->counter=-562952100897201
I have a theory that maybe this could occur if a tracee task gets scheduled off a CPU and then back onto the CPU (or another CPU) without rr resetting the counter in between, in which case KVM will try to setup a counter with some ticks already counted. That might happen infrequently enough to explain this. When a tracee task gets descheduled it's normally because it trapped to rr for whatever reason, in which case rr will reset the counter before that task gets scheduled again.
I haven't managed to disprove this theory yet.
Assuming this theory is correct, I'm having a hard time thinking of a workaround. I'm also having a hard time understanding why this works OK on ci.rr-project.org.
One possible workaround would to be try pinning our IN_TXCP counter. If pinning fails we should be able to detect that and ignore it ... and hope HLE didn't get used in that interval.
Also I think we can/should fix KVM to force it to pass a zero sample_period for checkpointed events (since it's just going to fail otherwise). I'm testing that now but that should make rr work.
I think I can probably write test code to detect this KVM bug and run it on every rr session. The question then is what to do if we detect it.
I have a feeling that requesting pinning for rr's IN_TXCP counter won't help here because the failure is silent from the point of view of the guest. We just drop all the samples that occur after the tracee gets rescheduled.
This patch fixes rr tests for me:
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 06ce377..1cb8046 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -113,12 +113,17 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
.config = config,
};
+ attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
if (in_tx)
attr.config |= HSW_IN_TX;
- if (in_tx_cp)
+ if (in_tx_cp) {
+ /* HSW_IN_TX_CHECKPOINTED is not suppored with nonzero
+ period. Just clear the sample period so at least
+ allocating the counter doesn't fail. */
+ attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
-
- attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+ }
event = perf_event_create_kernel_counter(&attr, -1, current,
intr ? kvm_perf_overflow_intr :
The reason this doesn't show up in ci.rr-project.org is because that is currently running on Sandy Bridge hardware so our attempts to use IN_TXCP "just work"; the EOPNOTSUPP path isn't even reached.
I'll try adding a desched counter that we can just monitor from rr. If we detect that the tracee was descheduled more than once and we're running on a system where we can see the KVM PMU bug then just for that perfcounter measurement we can ignore the IN_TXCP counter and use the regular counter instead, and hope we didn't do any HLE in that interval.
I've also submitted a path to upstream KVM similar to what I mentioned above.
KVM patch submission: http://marc.info/?l=linux-kernel&m=148582794808419&w=2
I'm having difficulty getting test code in rr to produce the KVM bug reliably :-(
OK, this code reproduces the bug easily. Setting a sample_period
of 0 is the key. Then we just need to disable and reenable the counter to trigger the EOPNOTSUPP
and subsequent failure to count events.
#include <linux/perf_event.h>
#include <stdio.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <unistd.h>
int main(void) {
struct perf_event_attr attr;
int fd;
long long count;
memset(&attr, 0, sizeof(attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
attr.sample_period = 0;
attr.exclude_kernel = 1;
attr.exclude_guest = 1;
fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
for (int i = 0; i < 500; ++i) {
putchar('.');
}
read(fd, &count, sizeof(count));
printf("\nConditional branches: %lld\n%s\n", count,
count < 500 ? "FAILED" : "PASSED");
return 0;
}
So I can detect the bug just fine now, but my workaround doesn't work reliably. Sometimes we still drop events and fail to detect that, i.e. didn't detect more than one desched.
I guess relying on counting descheds is inherently fragile, even if I could get it to work, because it relies on the assumption that the guest OS never sets a counter register except when absolutely necessary, and there's no reason to believe that.
I'm thinking of just dropping support for the KVM-bug + HLE combination. If your application uses HLE, you can't use rr on it in a KVM guest whose host has the KVM IN_TXCP bug. We'll disable the rr HLE test when running in a buggy KVM guest. We'll have rr detect the KVM bug, and when detected we stop using the IN_TXCP counter and instead set up an HLE_RETIRED.START counter so we can detect HLE usage. When we detect it, we warn the user and abort unless --force-things is specified.
Now I'm gettiing suprious HLE events :-(. Could be related to https://github.com/mozilla/rr/issues/1961
It doesn't matter whether I count starts or aborts, I still get spurious events.
FTR I only see spurious HLE events on certain tests, e.g. async_signal_syscalls
. In that test they're always associated with a SIGSTKFLT
being received. The number of events reported is nondeterministic. Could be a hardware issue, could be something in the kernel I suppose.
OK another idea is to count IN_TX conditional branches and warn/bail if that rises above zero.
Yes, this looks like a winner!
We get ticks divergence in various async-signal testcases. rr revision 73ee8002b88b236d95a3001ad2e3a84cdd6a9e85 is the first bad changeset.