Closed rocallahan closed 4 years ago
Results from Zen 2 / Ryzen 9 3900X Matisse:
root@ubuntu:~# for i in $(seq 1 10); do perf stat -e 'cpu/event=0xc2/u,cpu/event=0xc4/u,cpu/event=0xc6/u' ./foo; done
./foo: Segmentation fault
Performance counter stats for './foo':
8589939046 cpu/event=0xc2/u
8589937056 cpu/event=0xc4/u
504 cpu/event=0xc6/u
2.019423217 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939045 cpu/event=0xc2/u
8589937055 cpu/event=0xc4/u
503 cpu/event=0xc6/u
2.013937315 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939043 cpu/event=0xc2/u
8589937053 cpu/event=0xc4/u
501 cpu/event=0xc6/u
2.012184478 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939045 cpu/event=0xc2/u
8589937055 cpu/event=0xc4/u
503 cpu/event=0xc6/u
2.015136473 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939066 cpu/event=0xc2/u
8589937076 cpu/event=0xc4/u
524 cpu/event=0xc6/u
2.106237815 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939060 cpu/event=0xc2/u
8589937070 cpu/event=0xc4/u
518 cpu/event=0xc6/u
2.086466676 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939044 cpu/event=0xc2/u
8589937054 cpu/event=0xc4/u
502 cpu/event=0xc6/u
2.020398949 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939065 cpu/event=0xc2/u
8589937075 cpu/event=0xc4/u
523 cpu/event=0xc6/u
2.098308800 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939063 cpu/event=0xc2/u
8589937073 cpu/event=0xc4/u
521 cpu/event=0xc6/u
2.096092893 seconds time elapsed
./foo: Segmentation fault
Performance counter stats for './foo':
8589939065 cpu/event=0xc2/u
8589937075 cpu/event=0xc4/u
523 cpu/event=0xc6/u
2.096787257 seconds time elapsed
The first 773 tests seem to fail when I run make test (except the first environment check) and then it seems to hang.
What kernel veersion did you collect those results on?
Ubuntu 18.04 LTS. Newer ones wouldn't boot, maybe due to the RDRAND fuckup or something - even though this BIOS is barely a week old, sigh. That is using kernel 4.15.0.
FYI, if it is the systemd RDRAND issue, Debian should work — the workaround is in version 241-4, which is in stable
and newer.
I'm guessing it is, can try Debian if needed. I wrote 4.5 originally but that probably should have been 4.15 of course as that's what the 18.04.2 livecd had. I realized this reading the news that it was just updated with 5.0.
Anyway if I need to restest something specific let me know.
@khuey @pipcet Sorry for nagging but is there anything further I can/should do to confirm/deny whether rr will be usable on Zen 2?
It would be useful to know if the results from that test binary are different on the latest kernel.
Linux ubuntu 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:12:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
./foo: Segmentation fault
Performance counter stats for './foo':
8589943198 cpu/event=0xc2/u
8589939031 cpu/event=0xc4/u
498 cpu/event=0xc6/u
1.976349693 seconds time elapsed
1.909500000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943203 cpu/event=0xc2/u
8589939036 cpu/event=0xc4/u
503 cpu/event=0xc6/u
1.995666718 seconds time elapsed
1.927300000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943200 cpu/event=0xc2/u
8589939033 cpu/event=0xc4/u
500 cpu/event=0xc6/u
1.987421704 seconds time elapsed
1.920565000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943219 cpu/event=0xc2/u
8589939052 cpu/event=0xc4/u
519 cpu/event=0xc6/u
2.060155899 seconds time elapsed
1.989380000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943218 cpu/event=0xc2/u
8589939051 cpu/event=0xc4/u
518 cpu/event=0xc6/u
2.056097273 seconds time elapsed
1.986810000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943205 cpu/event=0xc2/u
8589939038 cpu/event=0xc4/u
505 cpu/event=0xc6/u
2.002450342 seconds time elapsed
1.935343000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943225 cpu/event=0xc2/u
8589939058 cpu/event=0xc4/u
525 cpu/event=0xc6/u
2.085942737 seconds time elapsed
2.017330000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943230 cpu/event=0xc2/u
8589939063 cpu/event=0xc4/u
530 cpu/event=0xc6/u
2.088709423 seconds time elapsed
2.018087000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943220 cpu/event=0xc2/u
8589939053 cpu/event=0xc4/u
520 cpu/event=0xc6/u
2.060706893 seconds time elapsed
1.993520000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943205 cpu/event=0xc2/u
8589939038 cpu/event=0xc4/u
505 cpu/event=0xc6/u
1.998164297 seconds time elapsed
1.931053000 seconds user
0.000000000 seconds sys
Looks like c2-c6, c4-c6 and c2-c4 are all constant?
@khuey @pipcet Anything more that can be tested? All tests except the first seem to fail.
@rocallahan?
Sorry ... catching up here.
@gcp those results suggest that 0xc2 - 0xc6 might be usable (Retired Branch Instructions - Retired Far Control Transfers). Are you able to hack that into src/PerfCounters.cc
?
All tests except the first seem to fail.
Can you provide the verbose test log (ctest --verbose
)?
Also, can you rerun that foo test with 0xD1?
ubuntu@ubuntu:~$ for i in $(seq 1 10); do perf stat -e 'cpu/event=0xc2/u,cpu/event=0xc6/u,cpu/event=0xd1/u' ./foo; done
./foo: Segmentation fault
Performance counter stats for './foo':
8589943105 cpu/event=0xc2/u
521 cpu/event=0xc6/u
4294974925 cpu/event=0xd1/u
2.067338714 seconds time elapsed
1.998034000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943091 cpu/event=0xc2/u
507 cpu/event=0xc6/u
4294974925 cpu/event=0xd1/u
2.013885457 seconds time elapsed
1.944788000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943093 cpu/event=0xc2/u
509 cpu/event=0xc6/u
4294974925 cpu/event=0xd1/u
2.011394774 seconds time elapsed
1.943029000 seconds user
0.000000000 seconds sys
./foo: Segmentation fault
Performance counter stats for './foo':
8589943090 cpu/event=0xc2/u
506 cpu/event=0xc6/u
4294974925 cpu/event=0xd1/u
2.005370975 seconds time elapsed
1.938921000 seconds user
0.000000000 seconds sys
Jackpot I guess?!
Can you provide the verbose test log (ctest --verbose)?
After inspecting the code, it's just bailing out because it doesn't recognize the CPU model. I patched that up, made a Ryzen 2 based on Ryzen 1 (which seems to be using 0xd1?) and that is running tests now. Looks like most pass but there's still failures.
It doesn't seem to make any difference with join_threads parallel test, around 15 failures (from 200) both on 0x5100d1
, 0xd1
, 0xc4 - 0xc6
and 0xc2 - 0xc6
Same. Every combination has some tests intermittently fail. Typically alarm, thread join, mremap, etc. I guess Zen 2 is broken the same way as previous AMD CPUs.
It would still be nice to get a log from one of those failures.
On Thu, Sep 5, 2019, 11:35 AM Gian-Carlo Pascutto notifications@github.com wrote:
Same. Every combination has some tests intermittently fail. Typically alarm, thread join, mremap, etc. I guess Zen 2 is broken the same way as previous AMD CPUs.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/rr/issues/2034?email_source=notifications&email_token=AACPSBBKZMANU2PAE2TXVFTQIFGNRA5CNFSM4DNAXKB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6AIQDY#issuecomment-528517135, or mute the thread https://github.com/notifications/unsubscribe-auth/AACPSBCXIZ5B6AJ7RHKNWU3QIFGNRANCNFSM4DNAXKBQ .
With 0x1d
diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
index b1882fc6..07e2c790 100644
--- a/src/PerfCounters.cc
+++ b/src/PerfCounters.cc
@@ -69,6 +69,7 @@ enum CpuMicroarch {
IntelKabylake,
AMDF15R30,
AMDRyzen,
+ AMDRyzen2,
};
/*
@@ -125,6 +126,8 @@ static const PmuConfig pmu_configs[] = {
{ AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
{ AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
+ { AMDRyzen2, "AMD Ryzen 2", 0xc2, 0xc6, 0, 250,
+ PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
};
#define RR_SKID_MAX 1000
@@ -213,6 +216,8 @@ static CpuMicroarch get_cpu_microarch() {
return IntelKabylake;
case 0x30f00:
return AMDF15R30;
+ case 0x70f10:
+ return AMDRyzen2;
case 0x00f10:
if (ext_family == 8) {
if (!Flags::get().suppress_environment_warnings) {
From join_threads on 0xd1, skid 1000: https://gist.githubusercontent.com/Milek7/14c97f3c60e13bef19966cac9ac67f26/raw/f8e892eacc69c7678ca8685648c814d4cd21da95/gistfile1.txt
OK, I guess the foo
test misses some behavior (syscalls or context switches) trigger problems on Zen.
I repeated these tests on an AGESA,version ComboAM4 1.0.0.3ABB based BIOS but the failures are identical.
Hi, on a AMD Ryzen 7 2700 Eight-Core Processor Dell Inspiron 5676 with BIOS Version 1.3.0 / 10/08/2018 / Revision 5.3, and rr 5.2.0-248-g8f785f01e707 built with 'cmake -Ddisable32bit=ON' plus the proposed TAKEN_BRANCHES patch above for Zen 2, but also having added the interrupts event, including raising the skid to 1000, i.e.:
diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
index d8203ceafd45..67db71fb33d5 100644
--- a/src/PerfCounters.cc
+++ b/src/PerfCounters.cc
@@ -70,6 +70,7 @@ enum CpuMicroarch {
IntelCometlake,
AMDF15R30,
AMDRyzen,
+ AMDRyzen2,
};
/*
@@ -127,6 +128,7 @@ static const PmuConfig pmu_configs[] = {
{ AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
{ AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
+ { AMDRyzen2, "AMD Ryzen 2", 0xc4, 0xc6, 0x2c, 1000, PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK},
};
#define RR_SKID_MAX 1000
@@ -217,6 +219,10 @@ static CpuMicroarch get_cpu_microarch() {
return IntelCometlake;
case 0x30f00:
return AMDF15R30;
+ case 0x00f80: // AMD Ryzen 7 2700 Eight-Core Processor
+ return AMDRyzen2;
case 0x00f10:
if (ext_family == 8) {
if (!Flags::get().suppress_environment_warnings) {
I get:
With 99% tests passed, 13 tests failed out of 1153
Total Test time (real) = 600.45 sec
The following tests FAILED:
220 - join_threads (Failed)
720 - async_signal_syscalls2 (Failed)
730 - block_clone_interrupted (Failed)
766 - condvar_stress (Failed)
910 - string_instructions_multiwatch (Failed)
911 - string_instructions_multiwatch-no-syscallbuf (Failed)
934 - thread_open_race (Failed)
936 - thread_stress (Failed)
937 - thread_stress-no-syscallbuf (Failed)
1016 - checkpoint_prctl_name (Failed)
1068 - record_replay (Failed)
1124 - step1 (Failed)
1125 - step1-no-syscallbuf (Failed)
with 7 tests failing with a ticks assertion failure:
720: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: read'; expected 483777, got 483786
936: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: munmap'; expected 263066, got 263076
934: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: rrcall_notify_syscall_hook_exit'; expected 2290256, got 2290258
766: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: futex'; expected 21035, got 21032
1068: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: pread64'; expected 4321511, got 4321506
937: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: clone'; expected 264716, got 264717
220: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: munmap'; expected 54608, got 54613
If I instead switch back to the conditional branches event, like so:
@@ -127,6 +128,7 @@ static const PmuConfig pmu_configs[] = {
{ AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
{ AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
+ { AMDRyzen2, "AMD Ryzen 2", 0xd1, 0, 0x2c, 1000, PMU_TICKS_RCB | PMU_SKIP_INTEL_BUG_CHECK},
};
#define RR_SKID_MAX 1000
I get slightly better results:
99% tests passed, 11 tests failed out of 1153
Total Test time (real) = 723.75 sec
The following tests FAILED:
110 - exec_from_other_thread (Failed)
720 - async_signal_syscalls2 (Failed)
766 - condvar_stress (Failed)
789 - exec_many-no-syscallbuf (Failed)
859 - ptrace_remote_unmap-no-syscallbuf (Failed)
910 - string_instructions_multiwatch (Failed)
911 - string_instructions_multiwatch-no-syscallbuf (Failed)
934 - thread_open_race (Failed)
936 - thread_stress (Failed)
1124 - step1 (Failed)
1125 - step1-no-syscallbuf (Failed)
with five tests failing the ticks assertion, with a couple of them just off-by-one:
110: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: set_robust_list'; expected 97, got 98
720: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: rrcall_notify_syscall_hook_exit'; expected 1951292, got 1951306
789: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: execve'; expected 2676461, got 2676471
766: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: futex'; expected 9667, got 9669
936: -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: clone'; expected 71083, got 71082
I realize there is a lot of event-overflow-to-interrupt-execution skid on Ryzen processors, but those off-by-ones make me suspicious of a possible off-by-one in the condition of the assertion, e.g., Ryzen's interrupt counting event may include the overflow interrupt itself, whereas another vendor's may not.
A couple more things, upon having read this discussion:
the errata are about absolute count precision, and on unrelated events, as was correctly pointed out. However it is important to note that the errata are not about the general event skid, which is the problem rr has, I believe, with its ticks assertions.
I happened to notice rr commit 828f3f60acf5df252e41f36f1c4a968839e812fa ("Assert if a perfcounter interrupt is dropped") is based on a commit to the v4.12 kernel:
https://github.com/torvalds/linux/commit/cc1582c231ea
that had been reverted before v4.12 was released:
https://github.com/torvalds/linux/commit/6a8a75f3235724c
I see some follow up discussions, e.g.,:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1713339.html
but am not sure what the current status quo is. In any case, the kernel itself may or may not continue to be involved in count fluctuations, and I'm not sure if the rr commit should be amended.
Thanks, Kim
Thanks for your comments!
I realize there is a lot of event-overflow-to-interrupt-execution skid on Ryzen processors, but those off-by-ones make me suspicious of a possible off-by-one in the condition of the assertion, e.g., Ryzen's interrupt counting event may include the overflow interrupt itself, whereas another vendor's may not.
We don't need to worry about the skid.
I'm not sure what you mean by "Ryzen's interrupt counting event may include the overflow interrupt itself". We don't count interrupts. In RCB mode, we count the RCBs retired and set the counter to interrupt after a certain count has been reached. The errors you're seeing are all about the RCB counting functionality. We are assuming that the RCB count is not affected by when or if the interrupt fires.
The results you reported indicate that the event counts measured between system calls are varying a little bit between recording and replay. It would be interesting to know whether the reported results are due to an overcount or an undercount. If you send us your failed traces (packed with rr pack
) we have a closed-source binary instrumentation tool that can tell us whether it's an overcount or undercount. (The RCB test failures are probably the best ones to use.)
However it is important to note that the errata are not about the general event skid, which is the problem rr has, I believe, with its ticks assertions.
The errors you hit are unrelated to skid. Skid only comes into play when we're replaying towards an asynchronous event. The errors you reported happened when we were replaying to a synchronous event, in which case we are not configuring the counter to interrupt. (If you haven't read https://arxiv.org/abs/1705.05937, that is a good technical overview of how rr works.)
I happened to notice rr commit 828f3f6 ("Assert if a perfcounter interrupt is dropped") is based on a commit to the v4.12 kernel:
Yeah, we may not need that patch anymore, but I don't think there's anything wrong with it. The assertions it added should always be satisfied.
You're right that the bug, whatever it is, could be in the kernel and not the hardware.
AMD's precise sampling technology IBS (Instruction Based Sampling) would likely not help rr if rr is only interested in event counts
That makes sense. Thanks.
If you send us your failed traces (packed with rr pack) we have a closed-source binary instrumentation tool that can tell us whether it's an overcount or undercount.
If you do send us the failed traces, then please also make sure we have the full test output log because we'll need the full assertion failure text for each failed trace.
Tested on AGESA 1.0.0.4: still broken in the same way.
Huh. Strange. It should be 0x200000000 precisely, rather than 0x200001f81
c2 - c6 is, in fact, exactly 0x200000000, if you factor out ld.so by compiling with gcc -nostdlib -static -o foo foo.s
. And d1 is exactly 0x100000000.
The same executable that shows a deviation (without -static
) produces the same kind of deviation on an Intel CPU (Kaby Lake).
Now, despite this stability, rr record ./foo
with the -static
-built foo still fails with
Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 378738 ticks, expected no more than 377211
(with varying number of ticks) So I suspect something like ptrace is interfering.
While I'm here, for the record, Threadripper 3970X is cpu_type 0x30f10.
I should add, this is with kernel 5.3.9
I've been following this thread, and I wanted to add that while I don't have time/the experience to work on the issue, I do have a 1950X and I'm willing to help test whenever it is needed if a clear test procedure is given. I may even be able to set up a sandboxed user that you can SSH in and use directly to avoid comment synchronization if that helps.
Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 378738 ticks, expected no more than 377211
I think this error is caused by too small skid counter, increasing it should help with that issue. Problem is that it still runs unreliably, tests are randomly failing on ticks_now == trace_ticks
assertion.
Ryzen 3700x here, willing to help but completely ignorant to the internals of rr. Let me know how I can help. I convinced my team to use rr a year ago and I recently convinced them to switch to Ryzen, and just found out they're incompatible.
I've got a 2700x, also willing to help
Ryzen 3800X here also willing to help.
Thanks for the support, but we can't really do anything until we get the attention of AMD people like @kimphillamd again.
Thanks for the support, but we can't really do anything until we get the attention of AMD people like @kimphillamd again.
Hi, sorry, I got reassigned and am working on something else at the moment.
For those offering their help, I believe rocallahan had asked for failed traces - packed with rr pack, including the full assertion failure text for each failed trace - to find out under vs. over counts. I unfortunately never got around to doing that, and certainly don't know any under vs. over answers.
rocallahan, AMD's perfctr h/w implementation differs from modern Intel's in that not all active counters freeze when one overflows, so events may still be being counted by the time the PMU driver software manually turns them off, and after it turns them back on again. Only way to fix that is to use a single event, and tolerate the overhead. Note that there have been perfctr NMI handler fixes to the kernel lately, too. If you do indeed track this down to a problem with the latest kernels, by all means, let me know!
Since there's activity here, I'll leave a quick comment, although I originally wanted to wait until I get more fresh and complete information with the latest kernel.
On 3970X, I have been successfully using rr on single-threaded programs. I've done that on enough different programs that just having Ryzen support enabled in rr with a big warning would already be useful. I've been using 0xd1 and a skid size of 2000.
As far as the rr testcases are concerned, I haven't run them in a while, but last time I did some were still failing.
Something that might be worth noting: when running perf stat
and getting the counters for 0xd1 running on the small test assembly from @pipcet but with more iterations, the results are actually very accurate, even when context-switches were observed during the run, as per perf stat output. However, when doing a perf record
of those events, and a perf script
, the total count is lower than the expected count, which seems weird. Doing the same on an Intel CPU shows a total count that is higher than the expected count. Overshooting seems ok, but undershooting is really weird. That's one of the things I wanted to look more into before sending a message.
Also, FWIW, I've been able to reduce the number of failures of the join_threads test by re-running the replay from the same trace. rr could presumably do that automatically. But some of the errors stem from counts being off at recording time rather than replay time, and retrying the replay in those cases doesn't help.
rocallahan, AMD's perfctr h/w implementation differs from modern Intel's in that not all active counters freeze when one overflows, so events may still be being counted by the time the PMU driver software manually turns them off, and after it turns them back on again. Only way to fix that is to use a single event, and tolerate the overhead.
We configure the counters to count only user-space events, so when the PMU interrupt fires no more events should be counted even if active counters aren't "frozen".
Also, AFAICT for Ryzen we already only configure one counter.
I've done that on enough different programs that just having Ryzen support enabled in rr with a big warning would already be useful. I've been using 0xd1 and a skid size of 2000.
Isn't that already the current state of master? With a skid size of 1000.
However, when doing a perf record of those events, and a perf script, the total count is lower than the expected count, which seems weird.
Yes, that might be worth looking into.
@glandium can you run the rr tests and send me traces for those that fail with a ticks mismatch, along with the actual failure assertion(s)?
Isn't that already the current state of master?
Only a few Ryzens are supported by pmu_configs/get_cpu_microarch.
With a skid size of 1000.
For some reason, that was not enough for me.
can you run the rr tests and send me traces for those that fail with a ticks mismatch, along with the actual failure assertion(s)?
Sure, what do you need me to run specifically, and what files are you interested in?
Only a few Ryzens are supported by pmu_configs/get_cpu_microarch.
OK, feel free to send me a PR :-)
For some reason, that was not enough for me.
I wonder if this is just papering over real nondeterminism ... but OK.
If you clear out /tmp/rr-test-*
and then run
cd .../rr/obj; ctest -j10 --verbose > /tmp/output
you will presumably get some errors like Assertion
ticks_now == trace_ticks' failed to hold. ticks mismatch for '...'; expected . We'll need that output and any left-behind
/tmp/rr-test-*` directories zipped up.
Thanks!
We'll need that output and any left-behind
AFAICT, that's also in record.err or replay.err in /tmp/rr-test-*, so I just zipped all of that. Will send it your way on Matrix.
Ok, I looked into all 8 test failures from @glandium's test run that failed the ticks_now == trace_ticks
assertion. All 8 were overcounts by 1 or 2 events. 5 overcounted during recording, 3 overcounted during replay (and thus could probably succeed if they were replayed again on the AMD hardware or if we implemented fancy checkpoint and recovery logic).
I don't see any pattern in the execution leading up the divergence. Sometimes it's after executing lots of instructions since the last syscall, sometimes only a few.
So unfortunately I don't think we know much more than we did before, except that I'm confident that the immediate problem is a very rare intermittent overcount by 1 or 2 (maybe sometimes more if we looked at more tests).
(It's possible that when we get an overcount by 2 we're getting the same glitch twice, although if the glitches are independent I'd expect the probability of each glitch is so low we would never see 2 glitches in the same event. In fact 4 out of 8 failures were overcounts by 2.)
OK so here some experiments anyone can try. Assuming as new-ish a kernel and firmware updates etc as possible.
1) Test the overcount hypothesis. Run the rr tests (https://github.com/mozilla/rr/wiki/Building-And-Installing#tests) using make check
for full output. Focus on the tests that failed with a message like
Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for '...'; expected 2716819, got 2716817
If "got" is greater than "expected", try replaying the test recording using rr replay -a /tmp/rr-test-....../latest-trace
. If the overcount hypothesis is correct, it will succeed. If it doesn't succeed after you've tried replaying it a few times, let us know!
If "got" is always greater than "expected", or "expected" always greater than "got", after several runs of the full test suite, please let us know that as well. And of course let us know if you don't get any test failures.
2) Retry 1), booting with hyperthreading disabled.
3) Retry 1), booting with hyperthreading disabled and only one core enabled.
4) Retry 1), booting with hyperthreading disabled, one core enabled, and NMI watchdog disabled.
5) Anything else you can disable that might simplify the system.
If none of that helps at all, then that's worth reporting too I guess. Thanks!
BTW you will probably need a patch like this. Adjust the CPUID values as needed.
diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
index 7da26a51..28cb2d8b 100644
--- a/src/PerfCounters.cc
+++ b/src/PerfCounters.cc
@@ -70,6 +70,7 @@ enum CpuMicroarch {
IntelCometlake,
AMDF15R30,
AMDRyzen,
+ AMDRyzen2,
};
/*
@@ -137,9 +138,10 @@ static const PmuConfig pmu_configs[] = {
{ AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
{ AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
+ { AMDRyzen2, "AMD Ryzen 2", 0xd1, 0, 0x2c, 2000, PMU_TICKS_RCB | PMU_SKIP_INTEL_BUG_CHECK },
};
-#define RR_SKID_MAX 1000
+#define RR_SKID_MAX 10000
static string lowercase(const string& s) {
string c = s;
@@ -227,6 +229,8 @@ static CpuMicroarch get_cpu_microarch() {
return IntelCometlake;
case 0x30f00:
return AMDF15R30;
+ case 0x30f10: // AMD Ryzen Threadripper 3970X
+ return AMDRyzen2;
case 0x00f10:
if (ext_family == 8) {
if (!Flags::get().suppress_environment_warnings) {
Things have evolved on my end. I upgraded both the BIOS and the kernel, and now most errors are Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold.
rather than Assertion `ticks_now == trace_ticks' failed to hold
although some of those remain. The kernel logs don't seem to indicate a microcode version difference, but going back to the old kernel doesn't return to the old behavior either. Another error that I hadn't noticed before is Assertion `stable_exit' failed to hold. PTRACE_O_TRACEEXIT only supported for stable exits for now
, but that one is an assert during record.
None of disabling hyperthreading, cores and NMI watchdog make a difference.
rr replay
'ing a trace that failed with ticks_now == trace_ticks
does work most of the time.
rr replay
'ing a trace that failed with !counting_period || interrupt_val <= adjusted_counting_period
doesn't work.
Increasing the skid even further than 2000 "fixes" the !counting_period || interrupt_val <= adjusted_counting_period
ones.
It's also worth noting that a bunch of tests are actually timing out rather than failing with an assert.
Ryzen has a conditional branch counter. I have patches to use it here: https://github.com/mozilla/rr/tree/ryzen
To make it work reliably I had to increase the skid counter to 1000. That's pretty high, but OK. The patches make the skid size configurable per-architecture so we don't take that hit on Intel.
With these patches, most tests pass and the rest seem to be intermittent. In one run I get 10 failures out of 2068:
It appears that all these failures are due to intermittent overcounting. In most of them, during recording we seem to have overcounted a few conditional branches in the leadup to some syscall. In the rest, we seem to have overcounted during replay.
One interesting thing is that most of the syscalls where we detect the overcount are an
mprotect
(or a syscall following a syscall-bufferedmprotect
) that followed anmmap
. There are two exceptions, one aread
syscall and one awrite
syscall. I need to think about what this might mean.