Frequent "Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold." after update to Fedora 33

emilio commented 4 years ago

I recently updated my machine to Fedora 33 and I'm seeing a bunch of these assertions when recording / replaying Firefox.

The rr tests are enough to trigger this fairly frequently.

Verbose log from ctest -VV -j$(nproc): log.txt

emilio commented 4 years ago

Ah, for reference, kernel is 5.8.16-300.fc33.x86_64, CPU is AMD Ryzen Threadripper 3990X. I'll try to downgrade and confirm tests are passing there.

emilio commented 4 years ago

I installed 5.6.8-300.fc32.x86_64 from here https://kojipkgs.fedoraproject.org//packages/kernel/5.6.8/300.fc32/x86_64/ and all tests pass fwiw.

emilio commented 4 years ago

5.7.9-200.fc32.x86_64 is also good.

rocallahan commented 4 years ago

I get a similar issue, with lower frequency, on RHEL kernel 4.18.0-193.28.1.el8_2.x86_64. For me it's only showing up on LibreOffice tests (rr tests pass) but it shows up 100%.

rocallahan commented 4 years ago

So I'm not sure it's the same bug but it's suggestive that we see this on RH kernels in both cases.

emilio commented 4 years ago

5.8.5-300.fc33.x86_64 shows some ptrace_ failures, but those were an unrelated kernel regression IIRC, so let's call it good.

emilio commented 4 years ago

In 5.8.10-300.fc33.x86_64 I got a couple failures in two runs (1.txt, 2.txt), but none of them looked related to this. One seems like ppoll was interrupted (maybe test bug? Happened in both runs). The other tests are a timeout and an assert which looked potentially interesting:

Assertion `checksum == rec_checksum' failed to hold. Divergence in contents of memory segment after 'SYSCALL: openat'

But still nothing like what I'm looking for.

emilio commented 4 years ago

For my own future reference, what I'm doing to bisect is doing something like:

for p in kernel kernel-core kernel-modules kernel-modules-extra kernel-modules-internal; do wget https://kojipkgs.fedoraproject.org/packages/kernel/5.8.13/300.fc33/x86_64/$p-5.8.13-300.fc33.x86_64.rpm ;done
sudo dnf install *.rpm

emilio commented 4 years ago

5.8.13-300.fc33.x86_64 / 5.8.15-301.fc33.x86_64 are good, but I restarted with the latest kernel (5.8.16-300.fc33.x86_64) and now tests pass, so it's not clear to me what has to happen in order to make failing reproducible.

When I found it my desktop had been on for a while and gone through a few suspend cycles but other than that... (and yeah, I made sure to apply the zen-specific workarounds after suspend)

rocallahan commented 4 years ago

I updated my laptop to F33 and I do not see any errors of this type in one run of the rr tests. This is kernel 5.8.16-300.fc33.x86_64.

emilio commented 4 years ago

I haven't seen this lately. I don't know what happened on my computer that day but it's not like I can do forensics now so let's close this for now.

emilio commented 3 years ago

@rocallahan I figured what was causing this. The kernel might reduce perf_event_max_sample_rate automatically, and that seems to cause this.

Sometimes when compiling Firefox / Chromium / WebKit, I get these on dmesg:

[ 6969.817121] perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 6997.346206] perf: interrupt took too long (3221 > 3148), lowering kernel.perf_event_max_sample_rate to 62000
[ 7492.775438] perf: interrupt took too long (4032 > 4026), lowering kernel.perf_event_max_sample_rate to 49000

It seems increasing back this value helps. I don't know if the right fix is to just do that, or to set perf_cpu_time_max_percent=0 (which should disable this mechanism), or something else. Do you happen to know?

khuey commented 3 years ago

I think you'd want to set perf_cpu_time_max_percent=100 but yeah that looks right.

emilio commented 3 years ago

Is there anything actionable on rr's side? Or should I just document this in the wiki and close this?