rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.08k stars 577 forks source link

Intermittent make check test failures #1674

Open nical opened 8 years ago

nical commented 8 years ago

With a local build of today's master branch, make check fails with this at the end of the log:

The following tests FAILED:
    720 - reverse_alarm-no-syscallbuf (Failed)
    1452 - checkpoint_async_signal_syscalls_1000-32-no-syscallbuf (Failed)
    1501 - reverse_alarm-32 (Failed)

I haven't tried to build rr for a while on this computer so I am not implying that this would be a recent regression.

Some info about the computer: I am running Fedora 23 with gcc 5.3.1 installed (assuming that's what rr's build system uses)

$ cat /proc/version
Linux version 4.4.4-301.fc23.x86_64 (mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC) ) #1 SMP Fri Mar 4 17:42:42 UTC 2016
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
Stepping:              2
CPU MHz:               1244.003
CPU max MHz:           3800.0000
CPU min MHz:           1200.0000
BogoMIPS:              6984.35
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-11
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

Not sure what other information could be useful, let me know.

rocallahan commented 8 years ago

Are those failures repeatable --- if you run the tests again, do the same tests fail?

If you do something like

grep ^720: make_check_output_file

You can extract the output for each test. Pasting that here would be helpful.

nical commented 8 years ago

Turns out they are intermittent. here is the results of 3 other runs:

The following tests FAILED:
    1094 - set_ptracer-32-no-syscallbuf (Failed)

The following tests FAILED:
    129 - inotify (Failed)
    130 - inotify-no-syscallbuf (Failed)
    911 - inotify-32 (Failed)
    912 - inotify-32-no-syscallbuf (Failed)

The following tests FAILED:
    129 - inotify (Failed)
    130 - inotify-no-syscallbuf (Failed)
    393 - stdout_redirect (Failed)
    504 - clone_vfork-no-syscallbuf (Failed)
    510 - condvar_stress-no-syscallbuf (Failed)
    911 - inotify-32 (Failed)
    912 - inotify-32-no-syscallbuf (Failed)
    1336 - mutex_pi_stress-32-no-syscallbuf (Failed)

I'll come back with some of the failing test outputs.

nical commented 8 years ago

Here are parts of the log around tests that failed in another run:

911: Test 'inotify_32' FAILED: : error during recording:
911: --------------------------------------------------
911: inotify_32-BXStjBdce: /home/nical/dev/mozilla/rr/32/inotify.c:20: main: Assertion `"FAILED: !" && check_cond(desc >= 0)' failed.
911: --------------------------------------------------
911: record.out:
911: --------------------------------------------------
911: FAILED: errno=28 (No space left on device)
911: --------------------------------------------------
911: Test inotify_32 failed, leaving behind /tmp/rr-test-inotify_32-BXStjBdce
911: To replay the failed test, run
911:   _RR_TRACE_DIR=/tmp/rr-test-inotify_32-BXStjBdce rr replay
912: Test 'inotify_32' FAILED: : error during recording:
912: --------------------------------------------------
912: inotify_32-JSDiFyNRS: /home/nical/dev/mozilla/rr/32/inotify.c:20: main: Assertion `"FAILED: !" && check_cond(desc >= 0)' failed.
912: --------------------------------------------------
912: record.out:
912: --------------------------------------------------
912: FAILED: errno=28 (No space left on device)
912: --------------------------------------------------
912: Test inotify_32 failed, leaving behind /tmp/rr-test-inotify_32-JSDiFyNRS
912: To replay the failed test, run
912:   _RR_TRACE_DIR=/tmp/rr-test-inotify_32-JSDiFyNRS rr replay
504: Test 'clone_vfork' PASSED
129: Test 'inotify' FAILED: : error during recording:
129: --------------------------------------------------
129: inotify-pfOq1uA3p: /home/nical/dev/mozilla/rr/src/test/inotify.c:20: main: Assertion `"FAILED: !" && check_cond(desc >= 0)' failed.
129: --------------------------------------------------
129: record.out:
129: --------------------------------------------------
129: FAILED: errno=28 (No space left on device)
129: --------------------------------------------------
129: Test inotify failed, leaving behind /tmp/rr-test-inotify-pfOq1uA3p
129: To replay the failed test, run
129:   _RR_TRACE_DIR=/tmp/rr-test-inotify-pfOq1uA3p rr replay
  1/1564 Test #911: inotify-32 ................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.31 sec
  2/1564 Test #912: inotify-32-no-syscallbuf ..................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.41 sec
  3/1564 Test #504: clone_vfork-no-syscallbuf .................................   Passed    0.81 sec
  4/1564 Test #129: inotify ...................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.82 sec
test 675
        Start 675: checkpoint_simple

675: Test command: /usr/bin/bash "/home/nical/dev/mozilla/rr/src/test/checkpoint_simple.run" "-b" "/home/nical/dev/mozilla/rr" "/home/nical/dev/mozilla/rr" "checkpoint_simple"
675: Test timeout computed to be: 9.99988e+06
test 509
        Start 509: condvar_stress

509: Test command: /usr/bin/bash "/home/nical/dev/mozilla/rr/src/test/condvar_stress.run" "-b" "/home/nical/dev/mozilla/rr" "/home/nical/dev/mozilla/rr" "condvar_stress"
509: Test timeout computed to be: 9.99988e+06
test 1455
        Start 1455: checkpoint_prctl_name-32

1455: Test command: /usr/bin/bash "/home/nical/dev/mozilla/rr/src/test/checkpoint_prctl_name.run" "-b" "/home/nical/dev/mozilla/rr" "/home/nical/dev/mozilla/rr" "checkpoint_prctl_name_32"
1455: Test timeout computed to be: 9.99988e+06
test 1300
        Start 1300: execve_loop-32-no-syscallbuf

1300: Test command: /usr/bin/bash "/home/nical/dev/mozilla/rr/src/test/execve_loop.run" "-n" "/home/nical/dev/mozilla/rr" "/home/nical/dev/mozilla/rr" "execve_loop_32"
1300: Test timeout computed to be: 9.99988e+06
130: Test 'inotify' FAILED: : error during recording:
130: --------------------------------------------------
130: inotify-xPlCd2yrq: /home/nical/dev/mozilla/rr/src/test/inotify.c:20: main: Assertion `"FAILED: !" && check_cond(desc >= 0)' failed.
130: --------------------------------------------------
130: record.out:
130: --------------------------------------------------
130: FAILED: errno=28 (No space left on device)
130: --------------------------------------------------
130: Test inotify failed, leaving behind /tmp/rr-test-inotify-xPlCd2yrq
130: To replay the failed test, run
130:   _RR_TRACE_DIR=/tmp/rr-test-inotify-xPlCd2yrq rr replay
393: Test 'stdout_redirect' PASSED
675: Checkpointing at event 1 ...
675: Test 'checkpoint_simple' PASSED
675: Checkpointing at event 4 ...
  5/1564 Test #130: inotify-no-syscallbuf .....................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.92 sec
839: Test 'creat_address_not_truncated_32' FAILED: : error during replay:
458: Test 'writev' PASSED
322: Test 'setuid' PASSED
360: Test 'sigsuspend' PASSED
1142: /home/nical/dev/mozilla/rr/src/test/util.sh: line 219: 22594 Aborted                 (core dumped) _RR_TRACE_DIR="$workdir" rr $GLOBAL_OPTIONS replay -a $replayflags > replay.out 2> replay.err
1142: Test 'sigsuspend_32' FAILED: : error during replay:
1142: --------------------------------------------------
1142: [FATAL /home/nical/dev/mozilla/rr/src/ReplaySession.cc:742:check_ticks_consistency() errno: 0 'Success'] 
1142:  (task 22596 (rec:22514) at time 166)
1142:  -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: mmap2'; expected 4925, got 4916
1142: [FATAL /home/nical/dev/mozilla/rr/src/log.cc:261:emergency_debug() errno: 0 'Success'] (session doesn't look interactive, aborting emergency debugging)
1142: --------------------------------------------------
1142: replay.out:
1142: --------------------------------------------------
1142: --------------------------------------------------
1142: Test sigsuspend_32 failed, leaving behind /tmp/rr-test-sigsuspend_32-SVMTh1Yb8
1142: To replay the failed test, run
1142:   _RR_TRACE_DIR=/tmp/rr-test-sigsuspend_32-SVMTh1Yb8 rr replay
225: Test 'perf_event' PASSED
1492/1564 Test #1154: splice-32-no-syscallbuf ...................................   Passed    0.56 sec
1493/1564 Test #814: brk-32-no-syscallbuf ......................................   Passed    0.65 sec
1494/1564 Test #184: mmap_short_file-no-syscallbuf .............................   Passed    0.31 sec
1495/1564 Test #458: writev-no-syscallbuf ......................................   Passed    0.43 sec
1496/1564 Test #322: setuid-no-syscallbuf ......................................   Passed    0.42 sec
1497/1564 Test #360: sigsuspend-no-syscallbuf ..................................   Passed    0.43 sec
1498/1564 Test #1142: sigsuspend-32-no-syscallbuf ...............................***Failed  Error regular expression found in output. Regex=[FAILED]  0.43 sec
nical commented 8 years ago

Here are the files "left behind" by the failing tests that the logs refer to (/tmp/rr-test-inotify-pfOq1uA3p, etc.): makecheck-errors.tar.gz

rocallahan commented 8 years ago

I'm seeing some of those intermittent failures as well. It looks like some kind of bug in the latest Fedora 23 kernel :-(.

The inotify failures are odd. Maybe we're leaking some resource...

rocallahan commented 8 years ago

I'm seeing some of those intermittent failures as well. It looks like some kind of bug in the latest Fedora 23 kernel :-(.

They don't seem to interfere with normal rr usage though.

rocallahan commented 8 years ago

ENOSPC means that you hit the per-user inotify watch limit. I should just make the test handle that. Have you got a lot of half-dead inotify test processes lying around, maybe?

rocallahan commented 8 years ago

Pushed 80a05acbededb77d6333a4c1fb592ea7ac64d0cc to fix inotify. I'll leave this open for the intermittents. I'll probably have to bisect the kernel :-/

Keno commented 8 years ago

@rocallahan since you seem to be in the process of clearing out intermittent failures, here are the ones I saw today on latest master:

        1616 - exec_stop-32-no-syscallbuf (Failed)
    1492 - reverse_step_long-32-no-syscallbuf (Failed)
    1661 - reverse_step_signal-32 (Failed)
Keno commented 8 years ago

exec_stop-32-no-syscallbuf happens every minute or so when running 10 copies in parallel repeatedly.

Example output:

exception: Timeout exceeded.
<pexpect.spawn object at 0x7f260b69fa90>
version: 3.1
command: /home/kfischer/rr-build/bin/rr
args: ['/home/kfischer/rr-build/bin/rr', '--suppress-environment-warnings', '--check-cached-mmaps', '--fatal-errors', 'replay', '-x', '/home/kfischer/rr/src/test/test_setup.gdb']
searcher: <pexpect.searcher_re object at 0x7f260b69fbd0>
buffer (last 100 chars): 'dressSpace.cc:1121:assert_segments_match() errno: EIO] cached mmap:\r\nRemote connection closed\r\n(rr) '
before (last 100 chars): 'dressSpace.cc:1121:assert_segments_match() errno: EIO] cached mmap:\r\nRemote connection closed\r\n(rr) '
after: <class 'pexpect.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 10282
child_fd: 6
closed: False
timeout: 100
delimiter: <class 'pexpect.EOF'>
logfile: <open file 'gdb_rr.log', mode 'w' at 0x7f260b7459c0>
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
Test 'exec_stop_32' FAILED: debug script failed
Test exec_stop_32 failed, leaving behind /tmp/rr-test-exec_stop_32-TQmpHfdm2
To replay the failed test, run
  _RR_TRACE_DIR=/tmp/rr-test-exec_stop_32-TQmpHfdm2 rr replay

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 101.89 sec

The following tests FAILED:
    1616 - exec_stop-32-no-syscallbuf (Failed)
rocallahan commented 8 years ago

What's in the gdb_rr.log?

I'm not planning to put much work into fixing more intermittents. I'll fix easy ones.

Keno commented 8 years ago

What's in the gdb_rr.log?

Where would I find that file, it's not in pwd.

Keno commented 8 years ago

FWIW, it's complaining that ends differ on the vdso mapping, but it also doesn't print the cached mmaps, so perhaps there is some corruption going on.

Keno commented 8 years ago

Here's the tail of the log (setting logfile to sys.stderr temporarily):

Remote debugging using :12104
warning: limiting remote suggested packet size (17073526 bytes) to 16384
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
0xf77200d0 in ?? () from /lib/ld-linux.so.2
(rr) c
c
Continuing.
exec(./exec_self_32-vb5oQMVzk, 'self') ...

[Thread 12050.12050] #1 stopped.
stepi
0xf771e432 in __kernel_vsyscall ()
(rr) stepi

[Thread 12050.12050] #1 stopped.
0xf771e432 in __kernel_vsyscall ()
(rr) stepi
stepi

[Thread 12050.12050] #1 stopped.
c
0xf771e432 in __kernel_vsyscall ()
(rr) c
Continuing.

[Thread 12050.12050] #1 stopped.
b execve
0xf771e432 in __kernel_vsyscall ()
(rr) b execve
Breakpoint 1 at 0xf75b7f40
(rr) rc
rc
Continuing.
[ERROR /home/kfischer/rr/src/AddressSpace.cc:1121:assert_segments_match() errno: EIO] cached mmap:
Remote connection closed
(rr)
rocallahan commented 8 years ago

Where would I find that file, it's not in pwd.

/tmp/rr-test-exec_stop_32-TQmpHfdm2/gdb_rr.log

Keno commented 8 years ago

Thanks, useful to know.

Keno commented 8 years ago

Dumping the actual vs expected mapping to the log:

[ERROR /home/kfischer/rr/src/AddressSpace.cc:1121:assert_segments_match() errno: EIO] cached mmap:ends differ 0xf7740000-0xf7748000 rw-p 00000000 00:00 0           should be 0xf7740000-0xf7746000 rw-p 00000000 00:00 0          [vdso]
rocallahan commented 8 years ago

I guess this will need extra logging to figure out. I'm not quite motivated enough to try to repro and analyze it myself...

In the past I have seen weirdness involving the VDSO. Specifically, once in while the ELF symbol table is different, arguably corrupt, with some symbols have extra incorrect definitions. Some tweaks to Monkeypatcher.cc reference that issue. This issue could be related.

Keno commented 8 years ago

Hmm, the problem seems to be that the kernel is not merging the maps.

08048000-08049000 r-xp 00000000 00:00 0
08049000-0804b000 rw-p 00000000 00:00 0
68000000-68200000 rwxp 00000000 00:00 0
70000000-70001000 r-xp 00000000 fc:02 9318463                            /home/kfischer/rr-build/bin/rr_page_32_replay
f7767000-f778d000 r-xp 00000000 fc:02 9332444                            /tmp/rr-test-exec_stop_32-Bp56H9fMY/exec_self_32-Bp56H9fMY-0/mmap_hardlink_10_librrpreload_32.so
f778d000-f778f000 rw-p 00025000 fc:02 9332444                            /tmp/rr-test-exec_stop_32-Bp56H9fMY/exec_self_32-Bp56H9fMY-0/mmap_hardlink_11_librrpreload_32.so
f778f000-f7795000 rw-p 00000000 00:00 0
f7795000-f7797000 rw-p 00000000 00:00 0                                  [vdso]
f7797000-f7798000 r-xp 00000000 00:00 0
f7798000-f77b8000 r-xp 00000000 fc:02 25952614                           /lib/i386-linux-gnu/ld-2.19.so
f77b8000-f77ba000 rw-p 0001f000 fc:02 25952614                           /lib/i386-linux-gnu/ld-2.19.so
fff24000-fff46000 rw-p 00000000 00:00 0                                  [stack:20482]
  (heap: (nil)-(nil))
0x8048000-0x8049000 r-xp 00000000 00:00 0
0x8049000-0x804b000 rw-p 00000000 00:00 0
0x68000000-0x68200000 rwxp 00000000 00:00 0
0x70000000-0x70001000 r-xp 00000000 fc:02 9318463    /home/kfischer/rr-build/bin/rr_page_32_replay
0xf7767000-0xf778d000 r-xp 00000000 fc:02 9332444    /tmp/rr-test-exec_stop_32-Bp56H9fMY/exec_self_32-Bp56H9fMY-0/mmap_hardlink_10_librrpreload_32.so
0xf778d000-0xf778f000 rw-p 00025000 fc:02 9332444    /tmp/rr-test-exec_stop_32-Bp56H9fMY/exec_self_32-Bp56H9fMY-0/mmap_hardlink_11_librrpreload_32.so
0xf778f000-0xf7797000 rw-p 00000000 00:00 0
0xf7797000-0xf7798000 r-xp 00000000 00:00 0
0xf7798000-0xf77b8000 r-xp 00000000 fc:02 25952614   /lib/i386-linux-gnu/ld-2.19.so
0xf77b8000-0xf77ba000 rw-p 0001f000 fc:02 25952614   /lib/i386-linux-gnu/ld-2.19.so
0xfff24000-0xfff46000 rw-p 00000000 00:00 0

and we're not merging them in processing, perhaps for similar reasons as the vdso name being incorrect? My kernel version is 3.13, so I don't have the fix for that issue.

rocallahan commented 8 years ago

What do you think should be merged there that isn't?

Keno commented 8 years ago

f778f000-f7795000 and f7795000-f7797000

rocallahan commented 8 years ago

What makes you think f778f000-f7795000 should be the VDSO? That would be surprising to me since normally the VDSO is only 8K.

Keno commented 8 years ago

I'm not saying it is part of the VDSO, I'm just observing that it appears to have been merged in our recording of the mmaps, but not during replay. If I understand correctly, before comparing mappings, we try to merge adjacent ones as much as possible. This would have prevented the failure here had not one of the mappings been marked as VDSO. I'm speculating that that may be for the same reason as the comment a few lines before:

  // Kernels before 3.16 have a bug where any mapping at the original VDSO
    // address is marked [vdso] even if the VDSO was unmapped and replaced by
    // something else, so if the kernel reports [vdso] it may be spurious and
    // we skip this check. See kernel commit
    // a62c34bd2a8a3f159945becd57401e478818d51c.
rocallahan commented 8 years ago

I'm just observing that it appears to have been merged in our recording of the mmaps, but not during replay.

This isn't a recording vs replay issue, since we don't record the maps list. This is a kernel vs rr during replay issue.

But you may be right about that comment being related. Maybe we coalesce because none of those areas are the VDSO, but the pre-3.16 kernel does not coalesce because it (incorrectly) thinks one of those areas is the VDSO. Though it's not clear in the output you showed where rr thinks the VDSO should be. Probably be helpful to include that as an extra line in the dump.

Keno commented 8 years ago

This isn't a recording vs replay issue, since we don't record the maps list. This is a kernel vs rr during replay issue.

Right, sorry.

Adding that logging output, I think this theory is making sense:

08048000-08049000 r-xp 00000000 00:00 0
08049000-0804b000 rw-p 00000000 00:00 0
68000000-68200000 rwxp 00000000 00:00 0
70000000-70001000 r-xp 00000000 fc:02 9318463                            /home/kfischer/rr-build/bin/rr_page_32_replay
f7789000-f77af000 r-xp 00000000 fc:02 9332444                            /tmp/rr-test-exec_stop_32-tPuDqkDVZ/exec_self_32-tPuDqkDVZ-0/mmap_hardlink_10_librrpreload_32.so
f77af000-f77b1000 rw-p 00025000 fc:02 9332444                            /tmp/rr-test-exec_stop_32-tPuDqkDVZ/exec_self_32-tPuDqkDVZ-0/mmap_hardlink_11_librrpreload_32.so
f77b1000-f77b7000 rw-p 00000000 00:00 0                                  [vdso]
f77b7000-f77b9000 rw-p 00000000 00:00 0
f77b9000-f77ba000 r-xp 00000000 00:00 0
f77ba000-f77da000 r-xp 00000000 fc:02 25952614                           /lib/i386-linux-gnu/ld-2.19.so
f77da000-f77dc000 rw-p 0001f000 fc:02 25952614                           /lib/i386-linux-gnu/ld-2.19.so
ffdd2000-ffdf4000 rw-p 00000000 00:00 0                                  [stack:28045]
  (heap: (nil)-(nil))
0x8048000-0x8049000 r-xp 00000000 00:00 0
0x8049000-0x804b000 rw-p 00000000 00:00 0
0x68000000-0x68200000 rwxp 00000000 00:00 0
0x70000000-0x70001000 r-xp 00000000 fc:02 9318463    /home/kfischer/rr-build/bin/rr_page_32_replay
0xf7789000-0xf77af000 r-xp 00000000 fc:02 9332444    /tmp/rr-test-exec_stop_32-tPuDqkDVZ/exec_self_32-tPuDqkDVZ-0/mmap_hardlink_10_librrpreload_32.so
0xf77af000-0xf77b1000 rw-p 00025000 fc:02 9332444    /tmp/rr-test-exec_stop_32-tPuDqkDVZ/exec_self_32-tPuDqkDVZ-0/mmap_hardlink_11_librrpreload_32.so
0xf77b1000-0xf77b9000 rw-p 00000000 00:00 0
0xf77b9000-0xf77ba000 r-xp 00000000 00:00 0
0xf77ba000-0xf77da000 r-xp 00000000 fc:02 25952614   /lib/i386-linux-gnu/ld-2.19.so
0xf77da000-0xf77dc000 rw-p 0001f000 fc:02 25952614   /lib/i386-linux-gnu/ld-2.19.so
0xffdd2000-0xffdf4000 rw-p 00000000 00:00 0
[ERROR /home/kfischer/rr/src/AddressSpace.cc:1123:assert_segments_match() errno: EIO] vdso is at0xf77b9000-0xf77ba000 r-xp 00000000 00:00 0
Keno commented 8 years ago

I'm still not quite sure why the kernel itself is not merging those maps, but I think the point is that we would have merged them in post processing had the kernel not (incorrectly) told us that one of them is vdso.

rocallahan commented 8 years ago

Right, thanks!

I think it's probably not worth trying to do anything about this given that it's fixed in newer kernels. Also, there's a range of kernels where intermittents are inevitable due to some perf-event bug that got fixed recently. So I'm satisfied with declaring that intermittents that only repro on pre-4.7 kernels aren't getting fixed.

Keno commented 8 years ago

Also, there's a range of kernels where intermittents are inevitable due to some perf-event bug that got fixed recently.

Do you know what kernel version that started with, so I can try to get one after 3.16, but before that one?

rocallahan commented 8 years ago

I'm still not quite sure why the kernel itself is not merging those maps,

Because it thinks one is VDSO and the other isn't, right?

but I think the point is that we would have merged them in post processing had the kernel not (incorrectly) told us that one of them is vdso.

Yes.

We could have some special-casing so that during replay, after we've done our exec emulation, we ignore the kernel-reported VDSO. But it's probably not worth it.

Do you know what kernel version that started with, so I can try to get one after 3.16, but before that one?

I think somewhere in 4.x.x so 3.17-3.19 are probably OK. But I'm not sure.

Keno commented 8 years ago

Because it thinks one is VDSO and the other isn't, right?

I'm not sure that's the case. From what I can tell, the misreporting vdso was only a printing issue (to /proc/xxx/maps), while the mappings were actually being tracked correctly in the backend.

dfaure commented 8 years ago

I'm, too, seeing inotify_add_watch returning ENOSPC even though the number of watches in the system (around 200) is far below /proc/sys/fs/inotify/max_user_watches (65536). not in mozilla, in other code (Qt and KF5, both using inotify_add_watch independently). It really looks like a regression in the kernel, did you guys figure out what kernel version introduced this? I have 4.7.2 right now.

rocallahan commented 8 years ago

No, we didn't. Personally I haven't seen any failures in the inotify test for many months.