Open stepthom opened 7 years ago
expected 3086, got 4294970382
That looks ... bad. I haven't seen anything like that before. I suspect a bug in Parallel's PMU virtualization. I'd focus on recording and replaying the simple
test and use strace or add logging to PerfCounters.cc
to get a detailed log of what's happening. Also you could dump the results of rr dump
somewhere so we can check if we got reasonable recording results.
Those values are different by exactly 2^31, fwiw.
I guess you could try masking off the high bit and see how far that gets you...
@rocallahan Sorry to be a noob, but I just downloaded rr
yesterday and know basically nothing about it or the codebase or how things work. I'm not sure how to perform the actions you suggested. I'm very happy to try if given a bit more direction.
One additional piece of information: I noticed today that each test sometimes passes, sometimes not. Seems to be some nondeterminism involved somehow. (Maybe that's expected, I'm not sure.)
rr uses performance counters to establish a timeline of program execution. This is then used during replay to deliver asynchronous events (e.g. SIGALRM
) to the tracee at the same point it was delivered by the kernel during recording. The code that interfaces with the performance counter APIs is in src/PerfCounters.cc
.
@rocallahan is suggesting that you modify read_counter
. If the counter values reported by Parallels is off by some fixed amount you may be able to compensate for it there.
As pointed out by @rickard-von-essen in https://github.com/mist64/xhyve/issues/91#issuecomment-255054749
Just a note about Parallels and the bug you link. The author uses a ~4 y old version of Parallels Desktop for Mac so I wouldn't assume that it the current state of PD PMU virtualization.
I can confirm that this looks the same on Parallels Desktop 12 for Mac Pro ed 12.0.2 (41353). With the virtualisation type set to Parallels.
$ dmesg|grep PMU
[ 0.009541] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Broken PMU hardware detected, using software events only.
$ uname -a
Linux fedora-24-x86-64.shared 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
@legal90 Do you know if there is anyone at Parallels who would like to dive into this?
Hi everyone! Yes, unfortunately, we have some issues with PMU in Linux guests in Parallels Desktop 12 and earlier. Our engineers at Parallels are working on that and it should be fixed in one of the upcoming updates for Parallels Desktop. I will let you know here when it is fixed. Sorry for the inconvenience 😞
Hi! PMU virtualization for Linux guests is fixed in Parallels Desktop 13. Let me know if you have any issues.
@Gumix That is great news! Does this mean that rr passes all tests when run in a Linux guest on Parallels Desktop 13?
@sidkshatriya Well, most of them passed :) I'm not familiar with rr, so don't know how to investigate these failures, but I hope that they're not caused by virtualization (don't see anything strange in internal logs).
97% tests passed, 54 tests failed out of 2117
The following tests FAILED:
42 - chew_cpu (Failed)
43 - chew_cpu-no-syscallbuf (Failed)
322 - prctl_caps (Failed)
568 - strict_priorities (Failed)
588 - syscallbuf_timeslice2 (Failed)
600 - thread_yield (Failed)
656 - async_signal_syscalls (Timeout)
658 - async_signal_syscalls2 (Timeout)
660 - async_signal_syscalls_siginfo (Failed)
667 - block_clone_interrupted-no-syscallbuf (Failed)
682 - breakpoint_overlap (Failed)
692 - check_lost_interrupts (Failed)
828 - string_instructions_async_signals (Failed)
830 - string_instructions_async_signals_shared (Timeout)
831 - string_instructions_async_signals_shared-no-syscallbuf (Failed)
840 - syscallbuf_signal_blocking_read (Failed)
877 - watchpoint_at_sched-no-syscallbuf (Failed)
884 - async_signal_syscalls_100 (Timeout)
886 - async_signal_syscalls_1000 (Timeout)
887 - async_signal_syscalls_1000-no-syscallbuf (Failed)
912 - break_time_slice (Failed)
913 - break_time_slice-no-syscallbuf (Failed)
938 - deliver_async_signal_during_syscalls (Timeout)
1044 - syscallbuf_timeslice_250 (Failed)
1045 - syscallbuf_timeslice_250-no-syscallbuf (Failed)
1100 - chew_cpu-32 (Failed)
1604 - stack_growth_syscallbuf-32 (Failed)
1626 - strict_priorities-32 (Failed)
1627 - strict_priorities-32-no-syscallbuf (Failed)
1646 - syscallbuf_timeslice2-32 (Failed)
1714 - async_signal_syscalls-32 (Timeout)
1718 - async_signal_syscalls_siginfo-32 (Failed)
1721 - async_usr1-32-no-syscallbuf (Failed)
1724 - block_clone_interrupted-32 (Failed)
1725 - block_clone_interrupted-32-no-syscallbuf (Failed)
1750 - check_lost_interrupts-32 (Failed)
1752 - clone_interruption-32 (Failed)
1768 - daemon_read-32 (Failed)
1803 - ignored_async_usr1-32-no-syscallbuf (Failed)
1868 - seccomp_signals-32 (Failed)
1886 - string_instructions_async_signals-32 (Failed)
1889 - string_instructions_async_signals_shared-32-no-syscallbuf (Timeout)
1898 - syscallbuf_signal_blocking_read-32 (Failed)
1934 - watchpoint_at_sched-32 (Failed)
1935 - watchpoint_at_sched-32-no-syscallbuf (Failed)
1937 - watchpoint_before_signal-32-no-syscallbuf (Failed)
1943 - async_signal_syscalls_100-32-no-syscallbuf (Failed)
1944 - async_signal_syscalls_1000-32 (Timeout)
1970 - break_time_slice-32 (Failed)
1978 - checkpoint_async_signal_syscalls_1000-32 (Timeout)
1983 - checkpoint_prctl_name-32-no-syscallbuf (Failed)
1996 - deliver_async_signal_during_syscalls-32 (Timeout)
2045 - reverse_alarm-32-no-syscallbuf (Failed)
2102 - syscallbuf_timeslice_250-32 (Failed)
There are many errors like:
----------------------------------------------------------
/rr/src/test/util.sh: line 210: 31571 Aborted (core dumped)
_RR_TRACE_DIR="$workdir" test-monitor $TIMEOUT record.err rr $GLOBAL_OPTIONS record $LIB_ARG $RECORD_ARGS "$exe" $exeargs > record.out 2> record.err
Test 'syscallbuf_timeslice_250_32' FAILED: : error during recording:
--------------------------------------------------
[FATAL /rr/src/PerfCounters.cc:762:read_ticks() errno: SUCCESS]
(task 31580 (rec:31580) at time 177)
-> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 741 ticks, expected no more than 685
warning: remote target does not support file transfer, attempting to access files from local filesystem.
[FATAL /rr/src/log.cc:356:emergency_debug() errno: SUCCESS] Can't resume execution from invalid state
What Linux kernel version are you using in the guest?
Showing us the errors from a few more of the tests might be helpful.
What Linux kernel version are you using in the guest?
4.11.8-300.fc26.x86_64
Showing us the errors from a few more of the tests might be helpful.
You are hitting: http://robert.ocallahan.org/2017/06/patch-on-linux-kernel-stable-branches.html
Try with a guest kernel that's not affected by that regression: http://robert.ocallahan.org/2017/07/upstream-stable-kernels-work-with-rr.html
Here are the results with 4.12.8-300.fc26.x86_64:
99% tests passed, 29 tests failed out of 2117
The following tests FAILED:
13 - alarm-no-syscallbuf (Failed)
42 - chew_cpu (Failed)
43 - chew_cpu-no-syscallbuf (Failed)
202 - keyctl (Failed)
203 - keyctl-no-syscallbuf (Failed)
568 - strict_priorities (Failed)
667 - block_clone_interrupted-no-syscallbuf (Failed)
692 - check_lost_interrupts (Failed)
804 - reverse_step_threads (Failed)
828 - string_instructions_async_signals (Failed)
876 - watchpoint_at_sched (Failed)
913 - break_time_slice-no-syscallbuf (Failed)
931 - cont_signal-no-syscallbuf (Failed)
1100 - chew_cpu-32 (Failed)
1101 - chew_cpu-32-no-syscallbuf (Failed)
1260 - keyctl-32 (Failed)
1261 - keyctl-32-no-syscallbuf (Failed)
1724 - block_clone_interrupted-32 (Failed)
1725 - block_clone_interrupted-32-no-syscallbuf (Failed)
1750 - check_lost_interrupts-32 (Failed)
1802 - ignored_async_usr1-32 (Failed)
1886 - string_instructions_async_signals-32 (Timeout)
1887 - string_instructions_async_signals-32-no-syscallbuf (Failed)
1888 - string_instructions_async_signals_shared-32 (Timeout)
1899 - syscallbuf_signal_blocking_read-32-no-syscallbuf (Failed)
1937 - watchpoint_before_signal-32-no-syscallbuf (Failed)
1970 - break_time_slice-32 (Failed)
1971 - break_time_slice-32-no-syscallbuf (Failed)
2045 - reverse_alarm-32-no-syscallbuf (Failed)
Thanks. It looks like PMU interrupts are sometimes being dropped :-(.
If you're sure you were running with 4.12.8 then that means interrupts are being dropped further down the stack, in Parallels or possibly by MacOS itself.
It looks like the chew_cpu test is reproducing this reliably.
I will check it. BTW, Linux perf subsystem configures APIC to "NMI" delivery mode, but the hypervisor intercepts it and writes "Normal" delivery mode to the real APIC for PMI. It may lead to some inaccuracy in the results.
I didn't see Parallels either way on list of supported VMs on the wiki page (https://github.com/mozilla/rr/wiki/Building-And-Installing), so I'm not sure if I'm barking up the wrong tree by trying to use
rr
in Parallels...I'm running Parallels 9. I have PMU virtualization enabled. Guest OS is Ubuntu 14.04. I cloned the HEAD of
rr
'smaster
branch, currently b0ae677d36dd4fcbbef41be0c01f45ee657e11b8. Built fine, butmake test
is failing for most the tests:If I look in
Testing/Temporary/LastTest.log.tmp
, I see things like:The output of
perf list
is:Also: