rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.18k stars 586 forks source link

Support AMD Ryzen? #2034

Closed rocallahan closed 4 years ago

rocallahan commented 7 years ago

Ryzen has a conditional branch counter. I have patches to use it here: https://github.com/mozilla/rr/tree/ryzen

To make it work reliably I had to increase the skid counter to 1000. That's pretty high, but OK. The patches make the skid size configurable per-architecture so we don't take that hit on Intel.

With these patches, most tests pass and the rest seem to be intermittent. In one run I get 10 failures out of 2068:

        327 - proc_fds (Failed)
        388 - ptracer_death_multithread_peer-no-syscallbuf (Failed)
        688 - condvar_stress-no-syscallbuf (Failed)
        835 - thread_stress (Failed)
        904 - checkpoint_prctl_name-no-syscallbuf (Failed)
        1362 - proc_fds-32-no-syscallbuf (Failed)
        1679 - async_signal_syscalls2-32 (Failed)
        1722 - condvar_stress-32-no-syscallbuf (Failed)
        1869 - thread_stress-32 (Failed)
        1870 - thread_stress-32-no-syscallbuf (Failed)

It appears that all these failures are due to intermittent overcounting. In most of them, during recording we seem to have overcounted a few conditional branches in the leadup to some syscall. In the rest, we seem to have overcounted during replay.

One interesting thing is that most of the syscalls where we detect the overcount are an mprotect (or a syscall following a syscall-buffered mprotect) that followed an mmap. There are two exceptions, one a read syscall and one a write syscall. I need to think about what this might mean.

gcp commented 5 years ago

Results from Zen 2 / Ryzen 9 3900X Matisse:

root@ubuntu:~# for i in $(seq 1 10); do perf stat -e 'cpu/event=0xc2/u,cpu/event=0xc4/u,cpu/event=0xc6/u' ./foo; done
./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939046      cpu/event=0xc2/u                                            
        8589937056      cpu/event=0xc4/u                                            
               504      cpu/event=0xc6/u                                            

       2.019423217 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939045      cpu/event=0xc2/u                                            
        8589937055      cpu/event=0xc4/u                                            
               503      cpu/event=0xc6/u                                            

       2.013937315 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939043      cpu/event=0xc2/u                                            
        8589937053      cpu/event=0xc4/u                                            
               501      cpu/event=0xc6/u                                            

       2.012184478 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939045      cpu/event=0xc2/u                                            
        8589937055      cpu/event=0xc4/u                                            
               503      cpu/event=0xc6/u                                            

       2.015136473 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939066      cpu/event=0xc2/u                                            
        8589937076      cpu/event=0xc4/u                                            
               524      cpu/event=0xc6/u                                            

       2.106237815 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939060      cpu/event=0xc2/u                                            
        8589937070      cpu/event=0xc4/u                                            
               518      cpu/event=0xc6/u                                            

       2.086466676 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939044      cpu/event=0xc2/u                                            
        8589937054      cpu/event=0xc4/u                                            
               502      cpu/event=0xc6/u                                            

       2.020398949 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939065      cpu/event=0xc2/u                                            
        8589937075      cpu/event=0xc4/u                                            
               523      cpu/event=0xc6/u                                            

       2.098308800 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939063      cpu/event=0xc2/u                                            
        8589937073      cpu/event=0xc4/u                                            
               521      cpu/event=0xc6/u                                            

       2.096092893 seconds time elapsed

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589939065      cpu/event=0xc2/u                                            
        8589937075      cpu/event=0xc4/u                                            
               523      cpu/event=0xc6/u                                            

       2.096787257 seconds time elapsed
gcp commented 5 years ago

The first 773 tests seem to fail when I run make test (except the first environment check) and then it seems to hang.

khuey commented 5 years ago

What kernel veersion did you collect those results on?

gcp commented 5 years ago

Ubuntu 18.04 LTS. Newer ones wouldn't boot, maybe due to the RDRAND fuckup or something - even though this BIOS is barely a week old, sigh. That is using kernel 4.15.0.

jld commented 5 years ago

FYI, if it is the systemd RDRAND issue, Debian should work — the workaround is in version 241-4, which is in stable and newer.

gcp commented 5 years ago

I'm guessing it is, can try Debian if needed. I wrote 4.5 originally but that probably should have been 4.15 of course as that's what the 18.04.2 livecd had. I realized this reading the news that it was just updated with 5.0.

Anyway if I need to restest something specific let me know.

gcp commented 5 years ago

@khuey @pipcet Sorry for nagging but is there anything further I can/should do to confirm/deny whether rr will be usable on Zen 2?

khuey commented 5 years ago

It would be useful to know if the results from that test binary are different on the latest kernel.

gcp commented 5 years ago

Linux ubuntu 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:12:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943198      cpu/event=0xc2/u                                            
        8589939031      cpu/event=0xc4/u                                            
               498      cpu/event=0xc6/u                                            

       1.976349693 seconds time elapsed

       1.909500000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943203      cpu/event=0xc2/u                                            
        8589939036      cpu/event=0xc4/u                                            
               503      cpu/event=0xc6/u                                            

       1.995666718 seconds time elapsed

       1.927300000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943200      cpu/event=0xc2/u                                            
        8589939033      cpu/event=0xc4/u                                            
               500      cpu/event=0xc6/u                                            

       1.987421704 seconds time elapsed

       1.920565000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943219      cpu/event=0xc2/u                                            
        8589939052      cpu/event=0xc4/u                                            
               519      cpu/event=0xc6/u                                            

       2.060155899 seconds time elapsed

       1.989380000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943218      cpu/event=0xc2/u                                            
        8589939051      cpu/event=0xc4/u                                            
               518      cpu/event=0xc6/u                                            

       2.056097273 seconds time elapsed

       1.986810000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943205      cpu/event=0xc2/u                                            
        8589939038      cpu/event=0xc4/u                                            
               505      cpu/event=0xc6/u                                            

       2.002450342 seconds time elapsed

       1.935343000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943225      cpu/event=0xc2/u                                            
        8589939058      cpu/event=0xc4/u                                            
               525      cpu/event=0xc6/u                                            

       2.085942737 seconds time elapsed

       2.017330000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943230      cpu/event=0xc2/u                                            
        8589939063      cpu/event=0xc4/u                                            
               530      cpu/event=0xc6/u                                            

       2.088709423 seconds time elapsed

       2.018087000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943220      cpu/event=0xc2/u                                            
        8589939053      cpu/event=0xc4/u                                            
               520      cpu/event=0xc6/u                                            

       2.060706893 seconds time elapsed

       1.993520000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943205      cpu/event=0xc2/u                                            
        8589939038      cpu/event=0xc4/u                                            
               505      cpu/event=0xc6/u                                            

       1.998164297 seconds time elapsed

       1.931053000 seconds user
       0.000000000 seconds sys
gcp commented 5 years ago

Looks like c2-c6, c4-c6 and c2-c4 are all constant?

gcp commented 5 years ago

@khuey @pipcet Anything more that can be tested? All tests except the first seem to fail.

khuey commented 5 years ago

@rocallahan?

rocallahan commented 5 years ago

Sorry ... catching up here.

@gcp those results suggest that 0xc2 - 0xc6 might be usable (Retired Branch Instructions - Retired Far Control Transfers). Are you able to hack that into src/PerfCounters.cc?

All tests except the first seem to fail.

Can you provide the verbose test log (ctest --verbose)?

rocallahan commented 5 years ago

Also, can you rerun that foo test with 0xD1?

gcp commented 5 years ago
ubuntu@ubuntu:~$ for i in $(seq 1 10); do perf stat -e 'cpu/event=0xc2/u,cpu/event=0xc6/u,cpu/event=0xd1/u' ./foo; done
./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943105      cpu/event=0xc2/u                                            
               521      cpu/event=0xc6/u                                            
        4294974925      cpu/event=0xd1/u                                            

       2.067338714 seconds time elapsed

       1.998034000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943091      cpu/event=0xc2/u                                            
               507      cpu/event=0xc6/u                                            
        4294974925      cpu/event=0xd1/u                                            

       2.013885457 seconds time elapsed

       1.944788000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943093      cpu/event=0xc2/u                                            
               509      cpu/event=0xc6/u                                            
        4294974925      cpu/event=0xd1/u                                            

       2.011394774 seconds time elapsed

       1.943029000 seconds user
       0.000000000 seconds sys

./foo: Segmentation fault

 Performance counter stats for './foo':

        8589943090      cpu/event=0xc2/u                                            
               506      cpu/event=0xc6/u                                            
        4294974925      cpu/event=0xd1/u                                            

       2.005370975 seconds time elapsed

       1.938921000 seconds user
       0.000000000 seconds sys

Jackpot I guess?!

gcp commented 5 years ago

Can you provide the verbose test log (ctest --verbose)?

After inspecting the code, it's just bailing out because it doesn't recognize the CPU model. I patched that up, made a Ryzen 2 based on Ryzen 1 (which seems to be using 0xd1?) and that is running tests now. Looks like most pass but there's still failures.

Milek7 commented 5 years ago

It doesn't seem to make any difference with join_threads parallel test, around 15 failures (from 200) both on 0x5100d1, 0xd1, 0xc4 - 0xc6 and 0xc2 - 0xc6

gcp commented 5 years ago

Same. Every combination has some tests intermittently fail. Typically alarm, thread join, mremap, etc. I guess Zen 2 is broken the same way as previous AMD CPUs.

khuey commented 5 years ago

It would still be nice to get a log from one of those failures.

On Thu, Sep 5, 2019, 11:35 AM Gian-Carlo Pascutto notifications@github.com wrote:

Same. Every combination has some tests intermittently fail. Typically alarm, thread join, mremap, etc. I guess Zen 2 is broken the same way as previous AMD CPUs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/rr/issues/2034?email_source=notifications&email_token=AACPSBBKZMANU2PAE2TXVFTQIFGNRA5CNFSM4DNAXKB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6AIQDY#issuecomment-528517135, or mute the thread https://github.com/notifications/unsubscribe-auth/AACPSBCXIZ5B6AJ7RHKNWU3QIFGNRANCNFSM4DNAXKBQ .

gcp commented 5 years ago

With 0x1d

https://paste.rs/2ZS

gcp commented 5 years ago
diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
index b1882fc6..07e2c790 100644
--- a/src/PerfCounters.cc
+++ b/src/PerfCounters.cc
@@ -69,6 +69,7 @@ enum CpuMicroarch {
   IntelKabylake,
   AMDF15R30,
   AMDRyzen,
+  AMDRyzen2,
 };

 /*
@@ -125,6 +126,8 @@ static const PmuConfig pmu_configs[] = {
   { AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
     PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
   { AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
+  { AMDRyzen2, "AMD Ryzen 2", 0xc2, 0xc6, 0, 250,
+    PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
 };

 #define RR_SKID_MAX 1000
@@ -213,6 +216,8 @@ static CpuMicroarch get_cpu_microarch() {
       return IntelKabylake;
     case 0x30f00:
       return AMDF15R30;
+    case 0x70f10:
+      return AMDRyzen2;
     case 0x00f10:
       if (ext_family == 8) {
         if (!Flags::get().suppress_environment_warnings) {

https://paste.rs/IbM

gcp commented 5 years ago

0xc2 ~0xc6~ 0xc4

alarm https://paste.rs/uRj

block https://paste.rs/IGy

Milek7 commented 5 years ago

From join_threads on 0xd1, skid 1000: https://gist.githubusercontent.com/Milek7/14c97f3c60e13bef19966cac9ac67f26/raw/f8e892eacc69c7678ca8685648c814d4cd21da95/gistfile1.txt

rocallahan commented 5 years ago

OK, I guess the foo test misses some behavior (syscalls or context switches) trigger problems on Zen.

gcp commented 5 years ago

I repeated these tests on an AGESA,version ComboAM4 1.0.0.3ABB based BIOS but the failures are identical.

kimphillamd commented 4 years ago

Hi, on a AMD Ryzen 7 2700 Eight-Core Processor Dell Inspiron 5676 with BIOS Version 1.3.0 / 10/08/2018 / Revision 5.3, and rr 5.2.0-248-g8f785f01e707 built with 'cmake -Ddisable32bit=ON' plus the proposed TAKEN_BRANCHES patch above for Zen 2, but also having added the interrupts event, including raising the skid to 1000, i.e.:

diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
index d8203ceafd45..67db71fb33d5 100644
--- a/src/PerfCounters.cc
+++ b/src/PerfCounters.cc
@@ -70,6 +70,7 @@ enum CpuMicroarch {
   IntelCometlake,
   AMDF15R30,
   AMDRyzen,
+  AMDRyzen2,
 };

 /*
@@ -127,6 +128,7 @@ static const PmuConfig pmu_configs[] = {
   { AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
     PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
   { AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
+  { AMDRyzen2, "AMD Ryzen 2", 0xc4, 0xc6, 0x2c, 1000, PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK},
 };

 #define RR_SKID_MAX 1000
@@ -217,6 +219,10 @@ static CpuMicroarch get_cpu_microarch() {
        return IntelCometlake;
     case 0x30f00:
       return AMDF15R30;
+    case 0x00f80: // AMD Ryzen 7 2700 Eight-Core Processor
+      return AMDRyzen2;
     case 0x00f10:
       if (ext_family == 8) {
         if (!Flags::get().suppress_environment_warnings) {

I get:

With 99% tests passed, 13 tests failed out of 1153

Total Test time (real) = 600.45 sec

The following tests FAILED:
    220 - join_threads (Failed)
    720 - async_signal_syscalls2 (Failed)
    730 - block_clone_interrupted (Failed)
    766 - condvar_stress (Failed)
    910 - string_instructions_multiwatch (Failed)
    911 - string_instructions_multiwatch-no-syscallbuf (Failed)
    934 - thread_open_race (Failed)
    936 - thread_stress (Failed)
    937 - thread_stress-no-syscallbuf (Failed)
    1016 - checkpoint_prctl_name (Failed)
    1068 - record_replay (Failed)
    1124 - step1 (Failed)
    1125 - step1-no-syscallbuf (Failed)

with 7 tests failing with a ticks assertion failure:

720:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: read'; expected 483777, got 483786
936:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: munmap'; expected 263066, got 263076
934:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: rrcall_notify_syscall_hook_exit'; expected 2290256, got 2290258
766:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: futex'; expected 21035, got 21032
1068:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: pread64'; expected 4321511, got 4321506
937:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: clone'; expected 264716, got 264717
220:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: munmap'; expected 54608, got 54613

If I instead switch back to the conditional branches event, like so:

@@ -127,6 +128,7 @@ static const PmuConfig pmu_configs[] = {
   { AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
     PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
   { AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
+  { AMDRyzen2, "AMD Ryzen 2", 0xd1, 0, 0x2c, 1000, PMU_TICKS_RCB | PMU_SKIP_INTEL_BUG_CHECK},
 };

 #define RR_SKID_MAX 1000

I get slightly better results:

99% tests passed, 11 tests failed out of 1153

Total Test time (real) = 723.75 sec

The following tests FAILED:
    110 - exec_from_other_thread (Failed)
    720 - async_signal_syscalls2 (Failed)
    766 - condvar_stress (Failed)
    789 - exec_many-no-syscallbuf (Failed)
    859 - ptrace_remote_unmap-no-syscallbuf (Failed)
    910 - string_instructions_multiwatch (Failed)
    911 - string_instructions_multiwatch-no-syscallbuf (Failed)
    934 - thread_open_race (Failed)
    936 - thread_stress (Failed)
    1124 - step1 (Failed)
    1125 - step1-no-syscallbuf (Failed)

with five tests failing the ticks assertion, with a couple of them just off-by-one:

110:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: set_robust_list'; expected 97, got 98
720:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: rrcall_notify_syscall_hook_exit'; expected 1951292, got 1951306
789:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: execve'; expected 2676461, got 2676471
766:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: futex'; expected 9667, got 9669
936:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: clone'; expected 71083, got 71082

I realize there is a lot of event-overflow-to-interrupt-execution skid on Ryzen processors, but those off-by-ones make me suspicious of a possible off-by-one in the condition of the assertion, e.g., Ryzen's interrupt counting event may include the overflow interrupt itself, whereas another vendor's may not.

A couple more things, upon having read this discussion:

https://github.com/torvalds/linux/commit/cc1582c231ea

that had been reverted before v4.12 was released:

https://github.com/torvalds/linux/commit/6a8a75f3235724c

I see some follow up discussions, e.g.,:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1713339.html

but am not sure what the current status quo is. In any case, the kernel itself may or may not continue to be involved in count fluctuations, and I'm not sure if the rr commit should be amended.

https://github.com/torvalds/linux/blob/c74386d50fbaf4a54fd3fe560f1abc709c0cff4b/arch/x86/events/amd/ibs.c#L203

Thanks, Kim

rocallahan commented 4 years ago

Thanks for your comments!

I realize there is a lot of event-overflow-to-interrupt-execution skid on Ryzen processors, but those off-by-ones make me suspicious of a possible off-by-one in the condition of the assertion, e.g., Ryzen's interrupt counting event may include the overflow interrupt itself, whereas another vendor's may not.

We don't need to worry about the skid.

I'm not sure what you mean by "Ryzen's interrupt counting event may include the overflow interrupt itself". We don't count interrupts. In RCB mode, we count the RCBs retired and set the counter to interrupt after a certain count has been reached. The errors you're seeing are all about the RCB counting functionality. We are assuming that the RCB count is not affected by when or if the interrupt fires.

The results you reported indicate that the event counts measured between system calls are varying a little bit between recording and replay. It would be interesting to know whether the reported results are due to an overcount or an undercount. If you send us your failed traces (packed with rr pack) we have a closed-source binary instrumentation tool that can tell us whether it's an overcount or undercount. (The RCB test failures are probably the best ones to use.)

However it is important to note that the errata are not about the general event skid, which is the problem rr has, I believe, with its ticks assertions.

The errors you hit are unrelated to skid. Skid only comes into play when we're replaying towards an asynchronous event. The errors you reported happened when we were replaying to a synchronous event, in which case we are not configuring the counter to interrupt. (If you haven't read https://arxiv.org/abs/1705.05937, that is a good technical overview of how rr works.)

I happened to notice rr commit 828f3f6 ("Assert if a perfcounter interrupt is dropped") is based on a commit to the v4.12 kernel:

Yeah, we may not need that patch anymore, but I don't think there's anything wrong with it. The assertions it added should always be satisfied.

You're right that the bug, whatever it is, could be in the kernel and not the hardware.

AMD's precise sampling technology IBS (Instruction Based Sampling) would likely not help rr if rr is only interested in event counts

That makes sense. Thanks.

rocallahan commented 4 years ago

If you send us your failed traces (packed with rr pack) we have a closed-source binary instrumentation tool that can tell us whether it's an overcount or undercount.

If you do send us the failed traces, then please also make sure we have the full test output log because we'll need the full assertion failure text for each failed trace.

gcp commented 4 years ago

Tested on AGESA 1.0.0.4: still broken in the same way.

glandium commented 4 years ago

Huh. Strange. It should be 0x200000000 precisely, rather than 0x200001f81

c2 - c6 is, in fact, exactly 0x200000000, if you factor out ld.so by compiling with gcc -nostdlib -static -o foo foo.s. And d1 is exactly 0x100000000. The same executable that shows a deviation (without -static) produces the same kind of deviation on an Intel CPU (Kaby Lake).

Now, despite this stability, rr record ./foo with the -static-built foo still fails with

Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 378738 ticks, expected no more than 377211

(with varying number of ticks) So I suspect something like ptrace is interfering.

While I'm here, for the record, Threadripper 3970X is cpu_type 0x30f10.

glandium commented 4 years ago

I should add, this is with kernel 5.3.9

norcalli commented 4 years ago

I've been following this thread, and I wanted to add that while I don't have time/the experience to work on the issue, I do have a 1950X and I'm willing to help test whenever it is needed if a clear test procedure is given. I may even be able to set up a sandboxed user that you can SSH in and use directly to avoid comment synchronization if that helps.

Milek7 commented 4 years ago
Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 378738 ticks, expected no more than 377211

I think this error is caused by too small skid counter, increasing it should help with that issue. Problem is that it still runs unreliably, tests are randomly failing on ticks_now == trace_ticks assertion.

v-lopez commented 4 years ago

Ryzen 3700x here, willing to help but completely ignorant to the internals of rr. Let me know how I can help. I convinced my team to use rr a year ago and I recently convinced them to switch to Ryzen, and just found out they're incompatible.

evanrichter commented 4 years ago

I've got a 2700x, also willing to help

balbassam commented 4 years ago

Ryzen 3800X here also willing to help.

rocallahan commented 4 years ago

Thanks for the support, but we can't really do anything until we get the attention of AMD people like @kimphillamd again.

kimphillamd commented 4 years ago

Thanks for the support, but we can't really do anything until we get the attention of AMD people like @kimphillamd again.

Hi, sorry, I got reassigned and am working on something else at the moment.

For those offering their help, I believe rocallahan had asked for failed traces - packed with rr pack, including the full assertion failure text for each failed trace - to find out under vs. over counts. I unfortunately never got around to doing that, and certainly don't know any under vs. over answers.

rocallahan, AMD's perfctr h/w implementation differs from modern Intel's in that not all active counters freeze when one overflows, so events may still be being counted by the time the PMU driver software manually turns them off, and after it turns them back on again. Only way to fix that is to use a single event, and tolerate the overhead. Note that there have been perfctr NMI handler fixes to the kernel lately, too. If you do indeed track this down to a problem with the latest kernels, by all means, let me know!

glandium commented 4 years ago

Since there's activity here, I'll leave a quick comment, although I originally wanted to wait until I get more fresh and complete information with the latest kernel.

On 3970X, I have been successfully using rr on single-threaded programs. I've done that on enough different programs that just having Ryzen support enabled in rr with a big warning would already be useful. I've been using 0xd1 and a skid size of 2000.

As far as the rr testcases are concerned, I haven't run them in a while, but last time I did some were still failing.

Something that might be worth noting: when running perf stat and getting the counters for 0xd1 running on the small test assembly from @pipcet but with more iterations, the results are actually very accurate, even when context-switches were observed during the run, as per perf stat output. However, when doing a perf record of those events, and a perf script, the total count is lower than the expected count, which seems weird. Doing the same on an Intel CPU shows a total count that is higher than the expected count. Overshooting seems ok, but undershooting is really weird. That's one of the things I wanted to look more into before sending a message.

Also, FWIW, I've been able to reduce the number of failures of the join_threads test by re-running the replay from the same trace. rr could presumably do that automatically. But some of the errors stem from counts being off at recording time rather than replay time, and retrying the replay in those cases doesn't help.

rocallahan commented 4 years ago

rocallahan, AMD's perfctr h/w implementation differs from modern Intel's in that not all active counters freeze when one overflows, so events may still be being counted by the time the PMU driver software manually turns them off, and after it turns them back on again. Only way to fix that is to use a single event, and tolerate the overhead.

We configure the counters to count only user-space events, so when the PMU interrupt fires no more events should be counted even if active counters aren't "frozen".

Also, AFAICT for Ryzen we already only configure one counter.

rocallahan commented 4 years ago

I've done that on enough different programs that just having Ryzen support enabled in rr with a big warning would already be useful. I've been using 0xd1 and a skid size of 2000.

Isn't that already the current state of master? With a skid size of 1000.

rocallahan commented 4 years ago

However, when doing a perf record of those events, and a perf script, the total count is lower than the expected count, which seems weird.

Yes, that might be worth looking into.

@glandium can you run the rr tests and send me traces for those that fail with a ticks mismatch, along with the actual failure assertion(s)?

glandium commented 4 years ago

Isn't that already the current state of master?

Only a few Ryzens are supported by pmu_configs/get_cpu_microarch.

With a skid size of 1000.

For some reason, that was not enough for me.

can you run the rr tests and send me traces for those that fail with a ticks mismatch, along with the actual failure assertion(s)?

Sure, what do you need me to run specifically, and what files are you interested in?

rocallahan commented 4 years ago

Only a few Ryzens are supported by pmu_configs/get_cpu_microarch.

OK, feel free to send me a PR :-)

For some reason, that was not enough for me.

I wonder if this is just papering over real nondeterminism ... but OK.

If you clear out /tmp/rr-test-* and then run

cd .../rr/obj; ctest -j10 --verbose > /tmp/output

you will presumably get some errors like Assertionticks_now == trace_ticks' failed to hold. ticks mismatch for '...'; expected , got <N'>. We'll need that output and any left-behind/tmp/rr-test-*` directories zipped up. Thanks!

glandium commented 4 years ago

We'll need that output and any left-behind

AFAICT, that's also in record.err or replay.err in /tmp/rr-test-*, so I just zipped all of that. Will send it your way on Matrix.

rocallahan commented 4 years ago

Ok, I looked into all 8 test failures from @glandium's test run that failed the ticks_now == trace_ticks assertion. All 8 were overcounts by 1 or 2 events. 5 overcounted during recording, 3 overcounted during replay (and thus could probably succeed if they were replayed again on the AMD hardware or if we implemented fancy checkpoint and recovery logic).

rocallahan commented 4 years ago

I don't see any pattern in the execution leading up the divergence. Sometimes it's after executing lots of instructions since the last syscall, sometimes only a few.

So unfortunately I don't think we know much more than we did before, except that I'm confident that the immediate problem is a very rare intermittent overcount by 1 or 2 (maybe sometimes more if we looked at more tests).

(It's possible that when we get an overcount by 2 we're getting the same glitch twice, although if the glitches are independent I'd expect the probability of each glitch is so low we would never see 2 glitches in the same event. In fact 4 out of 8 failures were overcounts by 2.)

rocallahan commented 4 years ago

OK so here some experiments anyone can try. Assuming as new-ish a kernel and firmware updates etc as possible.

1) Test the overcount hypothesis. Run the rr tests (https://github.com/mozilla/rr/wiki/Building-And-Installing#tests) using make check for full output. Focus on the tests that failed with a message like

 Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for '...'; expected 2716819, got 2716817

If "got" is greater than "expected", try replaying the test recording using rr replay -a /tmp/rr-test-....../latest-trace. If the overcount hypothesis is correct, it will succeed. If it doesn't succeed after you've tried replaying it a few times, let us know!

If "got" is always greater than "expected", or "expected" always greater than "got", after several runs of the full test suite, please let us know that as well. And of course let us know if you don't get any test failures.

2) Retry 1), booting with hyperthreading disabled.

3) Retry 1), booting with hyperthreading disabled and only one core enabled.

4) Retry 1), booting with hyperthreading disabled, one core enabled, and NMI watchdog disabled.

5) Anything else you can disable that might simplify the system.

If none of that helps at all, then that's worth reporting too I guess. Thanks!

rocallahan commented 4 years ago

BTW you will probably need a patch like this. Adjust the CPUID values as needed.

    diff --git a/src/PerfCounters.cc b/src/PerfCounters.cc
    index 7da26a51..28cb2d8b 100644
    --- a/src/PerfCounters.cc
    +++ b/src/PerfCounters.cc
    @@ -70,6 +70,7 @@ enum CpuMicroarch {
       IntelCometlake,
       AMDF15R30,
       AMDRyzen,
    +  AMDRyzen2,
     };

     /*
    @@ -137,9 +138,10 @@ static const PmuConfig pmu_configs[] = {
       { AMDF15R30, "AMD Family 15h Revision 30h", 0xc4, 0xc6, 0, 250,
         PMU_TICKS_TAKEN_BRANCHES | PMU_SKIP_INTEL_BUG_CHECK },
       { AMDRyzen, "AMD Ryzen", 0x5100d1, 0, 0, 1000, PMU_TICKS_RCB },
    +  { AMDRyzen2, "AMD Ryzen 2", 0xd1, 0, 0x2c, 2000, PMU_TICKS_RCB | PMU_SKIP_INTEL_BUG_CHECK },
     };

    -#define RR_SKID_MAX 1000
    +#define RR_SKID_MAX 10000

     static string lowercase(const string& s) {
       string c = s;
    @@ -227,6 +229,8 @@ static CpuMicroarch get_cpu_microarch() {
            return IntelCometlake;
         case 0x30f00:
           return AMDF15R30;
    +    case 0x30f10: // AMD Ryzen Threadripper 3970X
    +      return AMDRyzen2;
         case 0x00f10:
           if (ext_family == 8) {
             if (!Flags::get().suppress_environment_warnings) {
glandium commented 4 years ago

Things have evolved on my end. I upgraded both the BIOS and the kernel, and now most errors are Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. rather than Assertion `ticks_now == trace_ticks' failed to hold although some of those remain. The kernel logs don't seem to indicate a microcode version difference, but going back to the old kernel doesn't return to the old behavior either. Another error that I hadn't noticed before is Assertion `stable_exit' failed to hold. PTRACE_O_TRACEEXIT only supported for stable exits for now, but that one is an assert during record.

None of disabling hyperthreading, cores and NMI watchdog make a difference.

rr replay'ing a trace that failed with ticks_now == trace_ticks does work most of the time. rr replay'ing a trace that failed with !counting_period || interrupt_val <= adjusted_counting_period doesn't work.

Increasing the skid even further than 2000 "fixes" the !counting_period || interrupt_val <= adjusted_counting_period ones.

It's also worth noting that a bunch of tests are actually timing out rather than failing with an assert.