rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.09k stars 580 forks source link

make test fails for most tests on a Parallels VM #1818

Open stepthom opened 7 years ago

stepthom commented 7 years ago

I didn't see Parallels either way on list of supported VMs on the wiki page (https://github.com/mozilla/rr/wiki/Building-And-Installing), so I'm not sure if I'm barking up the wrong tree by trying to use rr in Parallels...

I'm running Parallels 9. I have PMU virtualization enabled. Guest OS is Ubuntu 14.04. I cloned the HEAD of rr's master branch, currently b0ae677d36dd4fcbbef41be0c01f45ee657e11b8. Built fine, but make test is failing for most the tests:

[dev@ubuntu:~/rr/obj (master)] $ make test
Running tests...
Test project /home/dev/rr/obj
        Start   1: 64bit_child
  1/1832 Test   #1: 64bit_child ...............................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.36 sec
        Start   2: 64bit_child-no-syscallbuf
  2/1832 Test   #2: 64bit_child-no-syscallbuf .................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.35 sec
        Start   3: _llseek
  3/1832 Test   #3: _llseek ...................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start   4: _llseek-no-syscallbuf
  4/1832 Test   #4: _llseek-no-syscallbuf .....................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.25 sec
        Start   5: accept
  5/1832 Test   #5: accept ....................................................   Passed    0.33 sec
        Start   6: accept-no-syscallbuf
  6/1832 Test   #6: accept-no-syscallbuf ......................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.30 sec
        Start   7: alarm
  7/1832 Test   #7: alarm .....................................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.27 sec
        Start   8: alarm-no-syscallbuf
  8/1832 Test   #8: alarm-no-syscallbuf .......................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.24 sec
        Start   9: alarm2
  9/1832 Test   #9: alarm2 ....................................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.24 sec
        Start  10: alarm2-no-syscallbuf
 10/1832 Test  #10: alarm2-no-syscallbuf ......................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.23 sec
        Start  11: alsa_ioctl
 11/1832 Test  #11: alsa_ioctl ................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.23 sec
        Start  12: alsa_ioctl-no-syscallbuf
 12/1832 Test  #12: alsa_ioctl-no-syscallbuf ..................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start  13: arch_prctl
 13/1832 Test  #13: arch_prctl ................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.23 sec
        Start  14: arch_prctl-no-syscallbuf
 14/1832 Test  #14: arch_prctl-no-syscallbuf ..................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.26 sec
        Start  15: async_segv_ignored
 15/1832 Test  #15: async_segv_ignored ........................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.25 sec
        Start  16: async_segv_ignored-no-syscallbuf
 16/1832 Test  #16: async_segv_ignored-no-syscallbuf ..........................   Passed    0.16 sec
        Start  17: at_threadexit
 17/1832 Test  #17: at_threadexit .............................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.31 sec
        Start  18: at_threadexit-no-syscallbuf
 18/1832 Test  #18: at_threadexit-no-syscallbuf ...............................***Failed  Error regular expression found in output. Regex=[FAILED]  0.27 sec
        Start  19: bad_ip
 19/1832 Test  #19: bad_ip ....................................................   Passed    0.17 sec
        Start  20: bad_ip-no-syscallbuf
 20/1832 Test  #20: bad_ip-no-syscallbuf ......................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start  21: bad_syscall
 21/1832 Test  #21: bad_syscall ...............................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start  22: bad_syscall-no-syscallbuf
 22/1832 Test  #22: bad_syscall-no-syscallbuf .................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.23 sec
        Start  23: barrier

If I look in Testing/Temporary/LastTest.log.tmp, I see things like:

Start testing: Sep 21 12:13 EDT
----------------------------------------------------------
1/1832 Testing: 64bit_child
1/1832 Test: 64bit_child
Command: "/bin/bash" "/home/dev/rr/src/test/basic_test.run" "-b" "/home/dev/rr" "/home/dev/rr/obj" "64bit_child"
Directory: /home/dev/rr/obj
"64bit_child" start time: Sep 21 12:13 EDT
Output:
----------------------------------------------------------
/home/dev/rr/src/test/util.sh: line 219: 18001 Aborted                 (core dumped) _RR_TRACE_DIR="$workdir" rr $GLOBAL_OPTIONS replay -a $replayflags > replay.out 2> replay.err
Test '64bit_child' FAILED: : error during replay:
--------------------------------------------------
[FATAL /home/dev/rr/src/ReplaySession.cc:847:check_ticks_consistency() errno: SUCCESS]
 (task 18002 (rec:17999) at time 104)
 -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: open'; expected 3086, got 4294970382
[FATAL /home/dev/rr/src/log.cc:283:emergency_debug() errno: SUCCESS] (session doesn't look interactive, aborting emergency debugging)
--------------------------------------------------
replay.out:
--------------------------------------------------
--------------------------------------------------
Test 64bit_child failed, leaving behind /tmp/rr-test-64bit_child-P3OscNkkp
To replay the failed test, run
  _RR_TRACE_DIR=/tmp/rr-test-64bit_child-P3OscNkkp rr replay
<end of output>
Test time =   0.36 sec
----------------------------------------------------------
Test Fail Reason:
Error regular expression found in output. Regex=[FAILED]
"64bit_child" end time: Sep 21 12:13 EDT
"64bit_child" time elapsed: 00:00:00
----------------------------------------------------------

The output of perf list is:

[dev@ubuntu:~/rr/obj (master)] $ perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]
  dummy                                              [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]

  branch-instructions OR cpu/branch-instructions/    [Kernel PMU event]
  branch-misses OR cpu/branch-misses/                [Kernel PMU event]
  bus-cycles OR cpu/bus-cycles/                      [Kernel PMU event]
  cache-misses OR cpu/cache-misses/                  [Kernel PMU event]
  cache-references OR cpu/cache-references/          [Kernel PMU event]
  cpu-cycles OR cpu/cpu-cycles/                      [Kernel PMU event]
  instructions OR cpu/instructions/                  [Kernel PMU event]
  mem-loads OR cpu/mem-loads/                        [Kernel PMU event]
  mem-stores OR cpu/mem-stores/                      [Kernel PMU event]
  stalled-cycles-frontend OR cpu/stalled-cycles-frontend/ [Kernel PMU event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:<addr>[:access]                                [Hardware breakpoint]

  [ Tracepoints not available: Permission denied ]

Also:

[dev@ubuntu:~/rr/obj (master)] $ dmesg | grep PMU
[    5.040752] Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, full-width counters, Intel PMU driver.
[    5.253574] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
rocallahan commented 7 years ago

expected 3086, got 4294970382

That looks ... bad. I haven't seen anything like that before. I suspect a bug in Parallel's PMU virtualization. I'd focus on recording and replaying the simple test and use strace or add logging to PerfCounters.cc to get a detailed log of what's happening. Also you could dump the results of rr dump somewhere so we can check if we got reasonable recording results.

khuey commented 7 years ago

Those values are different by exactly 2^31, fwiw.

rocallahan commented 7 years ago

I guess you could try masking off the high bit and see how far that gets you...

stepthom commented 7 years ago

@rocallahan Sorry to be a noob, but I just downloaded rr yesterday and know basically nothing about it or the codebase or how things work. I'm not sure how to perform the actions you suggested. I'm very happy to try if given a bit more direction.

One additional piece of information: I noticed today that each test sometimes passes, sometimes not. Seems to be some nondeterminism involved somehow. (Maybe that's expected, I'm not sure.)

khuey commented 7 years ago

rr uses performance counters to establish a timeline of program execution. This is then used during replay to deliver asynchronous events (e.g. SIGALRM) to the tracee at the same point it was delivered by the kernel during recording. The code that interfaces with the performance counter APIs is in src/PerfCounters.cc.

@rocallahan is suggesting that you modify read_counter. If the counter values reported by Parallels is off by some fixed amount you may be able to compensate for it there.

sidkshatriya commented 7 years ago

As pointed out by @rickard-von-essen in https://github.com/mist64/xhyve/issues/91#issuecomment-255054749

Just a note about Parallels and the bug you link. The author uses a ~4 y old version of Parallels Desktop for Mac so I wouldn't assume that it the current state of PD PMU virtualization.

rickard-von-essen commented 7 years ago

I can confirm that this looks the same on Parallels Desktop 12 for Mac Pro ed 12.0.2 (41353). With the virtualisation type set to Parallels.

rickard-von-essen commented 7 years ago
$ dmesg|grep PMU
[    0.009541] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Broken PMU hardware detected, using software events only.
$ uname -a
Linux fedora-24-x86-64.shared 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

@legal90 Do you know if there is anyone at Parallels who would like to dive into this?

legal90 commented 7 years ago

Hi everyone! Yes, unfortunately, we have some issues with PMU in Linux guests in Parallels Desktop 12 and earlier. Our engineers at Parallels are working on that and it should be fixed in one of the upcoming updates for Parallels Desktop. I will let you know here when it is fixed. Sorry for the inconvenience 😞

Gumix commented 7 years ago

Hi! PMU virtualization for Linux guests is fixed in Parallels Desktop 13. Let me know if you have any issues.

sidkshatriya commented 7 years ago

@Gumix That is great news! Does this mean that rr passes all tests when run in a Linux guest on Parallels Desktop 13?

Gumix commented 7 years ago

@sidkshatriya Well, most of them passed :) I'm not familiar with rr, so don't know how to investigate these failures, but I hope that they're not caused by virtualization (don't see anything strange in internal logs).

97% tests passed, 54 tests failed out of 2117

The following tests FAILED:
         42 - chew_cpu (Failed)
         43 - chew_cpu-no-syscallbuf (Failed)
        322 - prctl_caps (Failed)
        568 - strict_priorities (Failed)
        588 - syscallbuf_timeslice2 (Failed)
        600 - thread_yield (Failed)
        656 - async_signal_syscalls (Timeout)
        658 - async_signal_syscalls2 (Timeout)
        660 - async_signal_syscalls_siginfo (Failed)
        667 - block_clone_interrupted-no-syscallbuf (Failed)
        682 - breakpoint_overlap (Failed)
        692 - check_lost_interrupts (Failed)
        828 - string_instructions_async_signals (Failed)
        830 - string_instructions_async_signals_shared (Timeout)
        831 - string_instructions_async_signals_shared-no-syscallbuf (Failed)
        840 - syscallbuf_signal_blocking_read (Failed)
        877 - watchpoint_at_sched-no-syscallbuf (Failed)
        884 - async_signal_syscalls_100 (Timeout)
        886 - async_signal_syscalls_1000 (Timeout)
        887 - async_signal_syscalls_1000-no-syscallbuf (Failed)
        912 - break_time_slice (Failed)
        913 - break_time_slice-no-syscallbuf (Failed)
        938 - deliver_async_signal_during_syscalls (Timeout)
        1044 - syscallbuf_timeslice_250 (Failed)
        1045 - syscallbuf_timeslice_250-no-syscallbuf (Failed)
        1100 - chew_cpu-32 (Failed)
        1604 - stack_growth_syscallbuf-32 (Failed)
        1626 - strict_priorities-32 (Failed)
        1627 - strict_priorities-32-no-syscallbuf (Failed)
        1646 - syscallbuf_timeslice2-32 (Failed)
        1714 - async_signal_syscalls-32 (Timeout)
        1718 - async_signal_syscalls_siginfo-32 (Failed)
        1721 - async_usr1-32-no-syscallbuf (Failed)
        1724 - block_clone_interrupted-32 (Failed)
        1725 - block_clone_interrupted-32-no-syscallbuf (Failed)
        1750 - check_lost_interrupts-32 (Failed)
        1752 - clone_interruption-32 (Failed)
        1768 - daemon_read-32 (Failed)
        1803 - ignored_async_usr1-32-no-syscallbuf (Failed)
        1868 - seccomp_signals-32 (Failed)
        1886 - string_instructions_async_signals-32 (Failed)
        1889 - string_instructions_async_signals_shared-32-no-syscallbuf (Timeout)
        1898 - syscallbuf_signal_blocking_read-32 (Failed)
        1934 - watchpoint_at_sched-32 (Failed)
        1935 - watchpoint_at_sched-32-no-syscallbuf (Failed)
        1937 - watchpoint_before_signal-32-no-syscallbuf (Failed)
        1943 - async_signal_syscalls_100-32-no-syscallbuf (Failed)
        1944 - async_signal_syscalls_1000-32 (Timeout)
        1970 - break_time_slice-32 (Failed)
        1978 - checkpoint_async_signal_syscalls_1000-32 (Timeout)
        1983 - checkpoint_prctl_name-32-no-syscallbuf (Failed)
        1996 - deliver_async_signal_during_syscalls-32 (Timeout)
        2045 - reverse_alarm-32-no-syscallbuf (Failed)
        2102 - syscallbuf_timeslice_250-32 (Failed)

There are many errors like:

----------------------------------------------------------
/rr/src/test/util.sh: line 210: 31571 Aborted (core dumped)
_RR_TRACE_DIR="$workdir" test-monitor $TIMEOUT record.err rr $GLOBAL_OPTIONS record $LIB_ARG $RECORD_ARGS "$exe" $exeargs > record.out 2> record.err
Test 'syscallbuf_timeslice_250_32' FAILED: : error during recording:
--------------------------------------------------
[FATAL /rr/src/PerfCounters.cc:762:read_ticks() errno: SUCCESS]
 (task 31580 (rec:31580) at time 177)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 741 ticks, expected no more than 685
warning: remote target does not support file transfer, attempting to access files from local filesystem.
[FATAL /rr/src/log.cc:356:emergency_debug() errno: SUCCESS] Can't resume execution from invalid state
rocallahan commented 7 years ago

What Linux kernel version are you using in the guest?

Showing us the errors from a few more of the tests might be helpful.

Gumix commented 7 years ago

What Linux kernel version are you using in the guest?

4.11.8-300.fc26.x86_64

Showing us the errors from a few more of the tests might be helpful.

rocallahan commented 7 years ago

You are hitting: http://robert.ocallahan.org/2017/06/patch-on-linux-kernel-stable-branches.html

Try with a guest kernel that's not affected by that regression: http://robert.ocallahan.org/2017/07/upstream-stable-kernels-work-with-rr.html

Gumix commented 7 years ago

Here are the results with 4.12.8-300.fc26.x86_64:

99% tests passed, 29 tests failed out of 2117

The following tests FAILED:
         13 - alarm-no-syscallbuf (Failed)
         42 - chew_cpu (Failed)
         43 - chew_cpu-no-syscallbuf (Failed)
        202 - keyctl (Failed)
        203 - keyctl-no-syscallbuf (Failed)
        568 - strict_priorities (Failed)
        667 - block_clone_interrupted-no-syscallbuf (Failed)
        692 - check_lost_interrupts (Failed)
        804 - reverse_step_threads (Failed)
        828 - string_instructions_async_signals (Failed)
        876 - watchpoint_at_sched (Failed)
        913 - break_time_slice-no-syscallbuf (Failed)
        931 - cont_signal-no-syscallbuf (Failed)
        1100 - chew_cpu-32 (Failed)
        1101 - chew_cpu-32-no-syscallbuf (Failed)
        1260 - keyctl-32 (Failed)
        1261 - keyctl-32-no-syscallbuf (Failed)
        1724 - block_clone_interrupted-32 (Failed)
        1725 - block_clone_interrupted-32-no-syscallbuf (Failed)
        1750 - check_lost_interrupts-32 (Failed)
        1802 - ignored_async_usr1-32 (Failed)
        1886 - string_instructions_async_signals-32 (Timeout)
        1887 - string_instructions_async_signals-32-no-syscallbuf (Failed)
        1888 - string_instructions_async_signals_shared-32 (Timeout)
        1899 - syscallbuf_signal_blocking_read-32-no-syscallbuf (Failed)
        1937 - watchpoint_before_signal-32-no-syscallbuf (Failed)
        1970 - break_time_slice-32 (Failed)
        1971 - break_time_slice-32-no-syscallbuf (Failed)
        2045 - reverse_alarm-32-no-syscallbuf (Failed)

Full log is here.

rocallahan commented 7 years ago

Thanks. It looks like PMU interrupts are sometimes being dropped :-(.

If you're sure you were running with 4.12.8 then that means interrupts are being dropped further down the stack, in Parallels or possibly by MacOS itself.

It looks like the chew_cpu test is reproducing this reliably.

Gumix commented 7 years ago

I will check it. BTW, Linux perf subsystem configures APIC to "NMI" delivery mode, but the hypervisor intercepts it and writes "Normal" delivery mode to the real APIC for PMI. It may lead to some inaccuracy in the results.