Open emilio opened 4 years ago
Ah, for reference, kernel is 5.8.16-300.fc33.x86_64
, CPU is AMD Ryzen Threadripper 3990X
. I'll try to downgrade and confirm tests are passing there.
I installed 5.6.8-300.fc32.x86_64
from here https://kojipkgs.fedoraproject.org//packages/kernel/5.6.8/300.fc32/x86_64/ and all tests pass fwiw.
5.7.9-200.fc32.x86_64
is also good.
I get a similar issue, with lower frequency, on RHEL kernel 4.18.0-193.28.1.el8_2.x86_64. For me it's only showing up on LibreOffice tests (rr tests pass) but it shows up 100%.
So I'm not sure it's the same bug but it's suggestive that we see this on RH kernels in both cases.
5.8.5-300.fc33.x86_64
shows some ptrace_
failures, but those were an unrelated kernel regression IIRC, so let's call it good.
In 5.8.10-300.fc33.x86_64
I got a couple failures in two runs (1.txt,
2.txt), but none of them looked related to this. One seems like ppoll was interrupted (maybe test bug? Happened in both runs). The other tests are a timeout and an assert which looked potentially interesting:
Assertion `checksum == rec_checksum' failed to hold. Divergence in contents of memory segment after 'SYSCALL: openat'
But still nothing like what I'm looking for.
For my own future reference, what I'm doing to bisect is doing something like:
for p in kernel kernel-core kernel-modules kernel-modules-extra kernel-modules-internal; do wget https://kojipkgs.fedoraproject.org/packages/kernel/5.8.13/300.fc33/x86_64/$p-5.8.13-300.fc33.x86_64.rpm ;done
sudo dnf install *.rpm
5.8.13-300.fc33.x86_64
/ 5.8.15-301.fc33.x86_64
are good, but I restarted with the latest kernel (5.8.16-300.fc33.x86_64
) and now tests pass, so it's not clear to me what has to happen in order to make failing reproducible.
When I found it my desktop had been on for a while and gone through a few suspend cycles but other than that... (and yeah, I made sure to apply the zen-specific workarounds after suspend)
I updated my laptop to F33 and I do not see any errors of this type in one run of the rr tests. This is kernel 5.8.16-300.fc33.x86_64
.
I haven't seen this lately. I don't know what happened on my computer that day but it's not like I can do forensics now so let's close this for now.
@rocallahan I figured what was causing this. The kernel might reduce perf_event_max_sample_rate
automatically, and that seems to cause this.
Sometimes when compiling Firefox / Chromium / WebKit, I get these on dmesg:
[ 6969.817121] perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 6997.346206] perf: interrupt took too long (3221 > 3148), lowering kernel.perf_event_max_sample_rate to 62000
[ 7492.775438] perf: interrupt took too long (4032 > 4026), lowering kernel.perf_event_max_sample_rate to 49000
It seems increasing back this value helps. I don't know if the right fix is to just do that, or to set perf_cpu_time_max_percent=0
(which should disable this mechanism), or something else. Do you happen to know?
I think you'd want to set perf_cpu_time_max_percent=100
but yeah that looks right.
Is there anything actionable on rr's side? Or should I just document this in the wiki and close this?
We too saw this on our CI (https://build.julialang.org/#/builders/34/builds/7571/steps/5/logs/stdio), but there was no corresponding dmesg notice.
@Keno Does setting perf_cpu_time_max_percent
make it go away?
I don't know. @staticfloat can you try setting this sysctl on the CI machine, so we can watch if we see this again?
Done. We'll see if it happens again.
https://build.julialang.org/#/builders/34/builds/7701 this was yesterday with the same failure, so the sysctl should be applied
Seeing this on AMD only BTW, happens about once a day or so for us.
We're continuing to see this a few times per week. No discernible pattern as to workload.
@rocallahan does IN_TXCP make any sense on AMD? I thought that was an Intel-specific feature.
We're continuing to see this a few times per week. No discernible pattern as to workload.
Still AMD only?
Still AMD only?
Yes
Alright, looking into this some more, I think IN_TXCP is indeed an Intel-only thing, but happens to be allowed on AMD since they happen to have some extra eventselect bits there, so as far as I can tell what this does is change the PMC event to 0x2d1
rather than plain 0xd1
. I can't find that documented anywhere, but clearly it must be counting something to trigger this assertion. I'll send a patch to stop trying to set IN_TXCP on AMD and we'll see if anything else starts going horribly wrong.
Nevermind, I think I confused myself by looking at the wrong line numbers. We don't actually set IN_TXCP on AMD. Will keep poking.
Perhaps the interrupt is just unreliable or gets dropped in some situations? I guess we could try bumping the skid size, but we're already allowing 50k ticks skid, which seems quite excessive.
Hmm, I came across https://github.com/torvalds/linux/commit/914123fa39042e651d79eaf86bbf63a1b938dddf by @tlendacky which has the scary words "PMC" and "race condition" in the commit message, but that commit message says something about spurious kernel messages and panics, while our situation would be more likely to be a missed interrupt. That said, in a later commit (https://github.com/torvalds/linux/commit/df4d29732fdad43a51284f826bec3e6ded177540), there is also a reference to long NMI latencies (100ms). 100ms worth of delay could certainly cause 50k ticks worth of branches to retire. I doubt it's actually that long, but that lends some credence at least to the possibility that this is actually just a really long skid. @tlendacky would you be able to shed some light on the expected latencies between a PMC overflow and the firing of the NMI?
I just added a commit that turns this assertion into a warning during record. I'm still interested in understanding why this happens, so if anybody comes up with any clues, please do let me know. But hopefully for now this should continue recording in production use cases, since we don't actually rely on non-overshoot during record (we do during replay).
I recently updated my machine to Fedora 33 and I'm seeing a bunch of these assertions when recording / replaying Firefox.
The rr tests are enough to trigger this fairly frequently.
Verbose log from
ctest -VV -j$(nproc)
: log.txt