rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.2k stars 585 forks source link

Frequent "Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold." after update to Fedora 33 #2720

Open emilio opened 4 years ago

emilio commented 4 years ago

I recently updated my machine to Fedora 33 and I'm seeing a bunch of these assertions when recording / replaying Firefox.

The rr tests are enough to trigger this fairly frequently.

Verbose log from ctest -VV -j$(nproc): log.txt

emilio commented 4 years ago

Ah, for reference, kernel is 5.8.16-300.fc33.x86_64, CPU is AMD Ryzen Threadripper 3990X. I'll try to downgrade and confirm tests are passing there.

emilio commented 4 years ago

I installed 5.6.8-300.fc32.x86_64 from here https://kojipkgs.fedoraproject.org//packages/kernel/5.6.8/300.fc32/x86_64/ and all tests pass fwiw.

emilio commented 4 years ago

5.7.9-200.fc32.x86_64 is also good.

rocallahan commented 4 years ago

I get a similar issue, with lower frequency, on RHEL kernel 4.18.0-193.28.1.el8_2.x86_64. For me it's only showing up on LibreOffice tests (rr tests pass) but it shows up 100%.

rocallahan commented 4 years ago

So I'm not sure it's the same bug but it's suggestive that we see this on RH kernels in both cases.

emilio commented 4 years ago

5.8.5-300.fc33.x86_64 shows some ptrace_ failures, but those were an unrelated kernel regression IIRC, so let's call it good.

emilio commented 4 years ago

In 5.8.10-300.fc33.x86_64 I got a couple failures in two runs (1.txt, 2.txt), but none of them looked related to this. One seems like ppoll was interrupted (maybe test bug? Happened in both runs). The other tests are a timeout and an assert which looked potentially interesting:

Assertion `checksum == rec_checksum' failed to hold. Divergence in contents of memory segment after 'SYSCALL: openat'

But still nothing like what I'm looking for.

emilio commented 4 years ago

For my own future reference, what I'm doing to bisect is doing something like:

for p in kernel kernel-core kernel-modules kernel-modules-extra kernel-modules-internal; do wget https://kojipkgs.fedoraproject.org/packages/kernel/5.8.13/300.fc33/x86_64/$p-5.8.13-300.fc33.x86_64.rpm ;done
sudo dnf install *.rpm
emilio commented 4 years ago

5.8.13-300.fc33.x86_64 / 5.8.15-301.fc33.x86_64 are good, but I restarted with the latest kernel (5.8.16-300.fc33.x86_64) and now tests pass, so it's not clear to me what has to happen in order to make failing reproducible.

When I found it my desktop had been on for a while and gone through a few suspend cycles but other than that... (and yeah, I made sure to apply the zen-specific workarounds after suspend)

rocallahan commented 4 years ago

I updated my laptop to F33 and I do not see any errors of this type in one run of the rr tests. This is kernel 5.8.16-300.fc33.x86_64.

emilio commented 4 years ago

I haven't seen this lately. I don't know what happened on my computer that day but it's not like I can do forensics now so let's close this for now.

emilio commented 3 years ago

@rocallahan I figured what was causing this. The kernel might reduce perf_event_max_sample_rate automatically, and that seems to cause this.

Sometimes when compiling Firefox / Chromium / WebKit, I get these on dmesg:

[ 6969.817121] perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 6997.346206] perf: interrupt took too long (3221 > 3148), lowering kernel.perf_event_max_sample_rate to 62000
[ 7492.775438] perf: interrupt took too long (4032 > 4026), lowering kernel.perf_event_max_sample_rate to 49000

It seems increasing back this value helps. I don't know if the right fix is to just do that, or to set perf_cpu_time_max_percent=0 (which should disable this mechanism), or something else. Do you happen to know?

khuey commented 3 years ago

I think you'd want to set perf_cpu_time_max_percent=100 but yeah that looks right.

emilio commented 3 years ago

Is there anything actionable on rr's side? Or should I just document this in the wiki and close this?

Keno commented 3 years ago

We too saw this on our CI (https://build.julialang.org/#/builders/34/builds/7571/steps/5/logs/stdio), but there was no corresponding dmesg notice.

khuey commented 3 years ago

@Keno Does setting perf_cpu_time_max_percent make it go away?

Keno commented 3 years ago

I don't know. @staticfloat can you try setting this sysctl on the CI machine, so we can watch if we see this again?

staticfloat commented 3 years ago

Done. We'll see if it happens again.

Keno commented 3 years ago

https://build.julialang.org/#/builders/34/builds/7701 this was yesterday with the same failure, so the sysctl should be applied

Keno commented 3 years ago

Seeing this on AMD only BTW, happens about once a day or so for us.

Keno commented 3 years ago

We're continuing to see this a few times per week. No discernible pattern as to workload.

Keno commented 3 years ago

@rocallahan does IN_TXCP make any sense on AMD? I thought that was an Intel-specific feature.

khuey commented 3 years ago

We're continuing to see this a few times per week. No discernible pattern as to workload.

Still AMD only?

Keno commented 3 years ago

Still AMD only?

Yes

Keno commented 3 years ago

Alright, looking into this some more, I think IN_TXCP is indeed an Intel-only thing, but happens to be allowed on AMD since they happen to have some extra eventselect bits there, so as far as I can tell what this does is change the PMC event to 0x2d1 rather than plain 0xd1. I can't find that documented anywhere, but clearly it must be counting something to trigger this assertion. I'll send a patch to stop trying to set IN_TXCP on AMD and we'll see if anything else starts going horribly wrong.

Keno commented 3 years ago

Nevermind, I think I confused myself by looking at the wrong line numbers. We don't actually set IN_TXCP on AMD. Will keep poking.

Keno commented 3 years ago

Perhaps the interrupt is just unreliable or gets dropped in some situations? I guess we could try bumping the skid size, but we're already allowing 50k ticks skid, which seems quite excessive.

Keno commented 3 years ago

Hmm, I came across https://github.com/torvalds/linux/commit/914123fa39042e651d79eaf86bbf63a1b938dddf by @tlendacky which has the scary words "PMC" and "race condition" in the commit message, but that commit message says something about spurious kernel messages and panics, while our situation would be more likely to be a missed interrupt. That said, in a later commit (https://github.com/torvalds/linux/commit/df4d29732fdad43a51284f826bec3e6ded177540), there is also a reference to long NMI latencies (100ms). 100ms worth of delay could certainly cause 50k ticks worth of branches to retire. I doubt it's actually that long, but that lends some credence at least to the possibility that this is actually just a really long skid. @tlendacky would you be able to shed some light on the expected latencies between a PMC overflow and the firing of the NMI?

Keno commented 3 years ago

I just added a commit that turns this assertion into a warning during record. I'm still interested in understanding why this happens, so if anybody comes up with any clues, please do let me know. But hopefully for now this should continue recording in production use cases, since we don't actually rely on non-overshoot during record (we do during replay).