rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.15k stars 584 forks source link

249 tests failed for 5.6.0: detach_sigkill etc #3625

Closed vt-alt closed 1 year ago

vt-alt commented 1 year ago

On bare metal Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz Linux 6.4.4 there are 249 test failures like this (first 10):

 114/1417 Test  #114: detach_sigkill .................................................***Failed  Error regular expression found in output. Regex=[FAILED]129.86 sec
 115/1417 Test  #115: detach_sigkill-no-syscallbuf ...................................***Failed  Error regular expression found in output. Regex=[FAILED]129.87 sec
 196/1417 Test  #196: futex_exit_race_sigsegv ........................................***Failed  Error regular expression found in output. Regex=[FAILED]130.09 sec
 197/1417 Test  #197: futex_exit_race_sigsegv-no-syscallbuf ..........................***Failed  Error regular expression found in output. Regex=[FAILED]130.00 sec
 466/1417 Test  #466: prctl_caps .....................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.23 sec
 467/1417 Test  #467: prctl_caps-no-syscallbuf .......................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.21 sec
 866/1417 Test  #866: alternate_thread_diversion .....................................***Failed  Error regular expression found in output. Regex=[FAILED]130.10 sec
 867/1417 Test  #867: alternate_thread_diversion-no-syscallbuf .......................***Failed  Error regular expression found in output. Regex=[FAILED]130.12 sec
 892/1417 Test  #892: block_clone_checkpoint .........................................***Failed  Error regular expression found in output. Regex=[FAILED]130.11 sec
 893/1417 Test  #893: block_clone_checkpoint-no-syscallbuf ...........................***Failed  Error regular expression found in output. Regex=[FAILED]130.10 sec
...

And 1168 tests are passed. Tests are run from testsuite after install. Full log LastTest.log (15M).

GitMensch commented 1 year ago

As 5.7.0 is out (since 11 hours before your post) - can you please redo the build based on that, then after build do bin/rr record true and bin/rr replay (to check it works in general), then run the testsuite again?

Note: Works fine here with the slightly older i7-1165G7.

If there is any failure please output uname -a && cat /etc/*release*, too.

vt-alt commented 1 year ago

(Looks like it's released 1 hour after my post), but of course I will retest .

khuey commented 1 year ago

If you run ./obj/bin/rr record ./obj/bin/detach_sigkill manually (edit paths as necessary) does rr produce a trace or does it just hang?

GitMensch commented 1 year ago

Looks like it's released 1 hour after my post

I've though 11 AM to be earlier than 10 PM, but you never know if those timestamps are all correct in the first place. Thanks for re-running the test.

And your issue already led me to recheck on a newer Intel (with Debian) and finding that it works nicely (and I can't reproduce the bug that bugs me most there, even with nearly identical kernel :-/ , so thank you for the issue in any case.

vt-alt commented 1 year ago

I've though 11 AM to be earlier than 10 PM, but you never know if those timestamps are all correct in the first place.

Ah you are right! I compared dates and in my time zone they both PM but 12 PM is actually 11 hours before 11 PM not 1 hour later.

Oct 3, 2023, 12:14 PM GMT+3
Oct 3, 2023, 11:20 PM GMT+3

@khuey @GitMensch

Thank you all for suggestions I will try new tests later this week.

vt-alt commented 1 year ago

If you run ./obj/bin/rr record ./obj/bin/detach_sigkill manually (edit paths as necessary) does rr produce a trace or does it just hang?

It just hangs.

# /usr/lib64/rr/testsuite/obj/bin/detach_sigkill
EXIT-SUCCESS
# rr /usr/lib64/rr/testsuite/obj/bin/detach_sigkill
rr: Saving execution to trace directory `/root/.local/share/rr/detach_sigkill-0'.

Processes:

root     1722954  0.0  0.0 850836 36108 pts/10   Sl+  17:00   0:00  |           \_ rr /usr/lib64/rr/testsuite/obj/bin/detach_sigkill
root     1722966  0.0  0.0   5884  3176 pts/10   S+   17:00   0:00  |               \_ /usr/lib64/rr/testsuite/obj/bin/detach_sigkill
root     1722967  0.0  0.0   8956  7760 pts/10   t+   17:00   0:00  |               |   \_ /usr/lib64/rr/testsuite/obj/bin/detach_sigkill
root     1722968  0.0  0.0   8956  8808 pts/10   S+   17:00   0:00  |               \_ /usr/lib64/rr/testsuite/obj/bin/detach_sigkill

# cat /proc/1722954/stack
[<0>] do_wait+0x160/0x300
[<0>] kernel_wait4+0x9e/0x140
[<0>] do_syscall_64+0x59/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
# cat /proc/1722966/stack
[<0>] do_wait+0x160/0x300
[<0>] kernel_wait4+0x9e/0x140
[<0>] do_syscall_64+0x59/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
# cat /proc/1722967/stack
[<0>] ptrace_stop.part.0+0x165/0x2d0
[<0>] ptrace_do_notify+0xa6/0xe0
[<0>] ptrace_notify+0x59/0x90
[<0>] do_exit+0x8c0/0xa80
[<0>] do_group_exit+0x2d/0x80
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] do_syscall_64+0x59/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
# cat /proc/1722968/stack
[<0>] __do_sys_pause+0x30/0x60
[<0>] do_syscall_64+0x59/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
#

Now I'm going to try 5.7.0.

vt-alt commented 1 year ago

With 5.7.0 test detach_sigkill succeeds but other tests still fail.

# rr --version
rr version 5.7.0 
# rr /usr/lib64/rr/testsuite/obj/bin/detach_sigkill
rr: Saving execution to trace directory `/root/.local/share/rr/detach_sigkill-1'.
EXIT-SUCCESS

This stat is for 940/1473 tests: 925 Passed, 15 Failed.

(I stopped testing in the middle because I need to update the system before weekend and failing tests take so long to timeout. But I will rerun.)

Top failures:

 894/1473 Test  #894: alternate_thread_diversion .....................................***Failed  Error regular expression found in output. Regex=[FAILED]129.11 sec
 895/1473 Test  #895: alternate_thread_diversion-no-syscallbuf .......................***Failed  Error regular expression found in output. Regex=[FAILED]129.13 sec
 920/1473 Test  #920: block_clone_checkpoint .........................................***Failed  Error regular expression found in output. Regex=[FAILED]129.11 sec
 921/1473 Test  #921: block_clone_checkpoint-no-syscallbuf ...........................***Failed  Error regular expression found in output. Regex=[FAILED]129.14 sec
 928/1473 Test  #928: blocked_bad_ip .................................................***Failed  Error regular expression found in output. Regex=[FAILED]129.24 sec
 929/1473 Test  #929: blocked_bad_ip-no-syscallbuf ...................................***Failed  Error regular expression found in output. Regex=[FAILED]129.23 sec
 934/1473 Test  #934: breakpoint .....................................................***Failed  Error regular expression found in output. Regex=[FAILED]129.11 sec
 935/1473 Test  #935: breakpoint-no-syscallbuf .......................................***Failed  Error regular expression found in output. Regex=[FAILED]129.12 sec
 936/1473 Test  #936: breakpoint_conditions ..........................................***Failed  Error regular expression found in output. Regex=[FAILED]129.12 sec
 937/1473 Test  #937: breakpoint_conditions-no-syscallbuf ............................***Failed  Error regular expression found in output. Regex=[FAILED]129.12 sec
 938/1473 Test  #938: breakpoint_overlap .............................................***Failed  Error regular expression found in output. Regex=[FAILED]129.14 sec
 939/1473 Test  #939: breakpoint_overlap-no-syscallbuf ...............................***Failed  Error regular expression found in output. Regex=[FAILED]129.14 sec
 940/1473 Test  #940: call_function ..................................................***Failed  Error regular expression found in output. Regex=[FAILED]129.12 sec

Strangely manual run of some tests succeeds:

#  /usr/lib64/rr/testsuite/obj/bin/block_clone_checkpoint 
EXIT-SUCCESS
# rr /usr/lib64/rr/testsuite/obj/bin/block_clone_checkpoint 
rr: Saving execution to trace directory `/root/.local/share/rr/block_clone_checkpoint-0'.
EXIT-SUCCESS
# rr /usr/lib64/rr/testsuite/obj/bin/block_clone_checkpoint 
rr: Saving execution to trace directory `/root/.local/share/rr/block_clone_checkpoint-1'.
EXIT-SUCCESS
# /usr/lib64/rr/testsuite/obj/bin/blocked_bad_ip 
EXIT-SUCCESS
Segmentation fault (core dumped)
# rr /usr/lib64/rr/testsuite/obj/bin/blocked_bad_ip 
rr: Saving execution to trace directory `/root/.local/share/rr/blocked_bad_ip-0'.
EXIT-SUCCESS
Segmentation fault
# /usr/lib64/rr/testsuite/obj/bin/breakpoint_conditions 
EXIT-SUCCESS
# rr /usr/lib64/rr/testsuite/obj/bin/breakpoint_conditions 
rr: Saving execution to trace directory `/root/.local/share/rr/breakpoint_conditions-0'.
EXIT-SUCCESS

I will rerun with --verbose (also will upgrade kernel to 6.5.7).

vt-alt commented 1 year ago

Log from the last testing: LastTest-2023-10-14.log 1222 tests passed, 251 tests failed.

 852/1473 Test  #852: vfork_done .....................................................***Failed  Error regular expression found in output. Regex=[FAILED]129.85 sec
 894/1473 Test  #894: alternate_thread_diversion .....................................***Failed  Error regular expression found in output. Regex=[FAILED]128.09 sec
 895/1473 Test  #895: alternate_thread_diversion-no-syscallbuf .......................***Failed  Error regular expression found in output. Regex=[FAILED]128.11 sec
 920/1473 Test  #920: block_clone_checkpoint .........................................***Failed  Error regular expression found in output. Regex=[FAILED]128.10 sec
 921/1473 Test  #921: block_clone_checkpoint-no-syscallbuf ...........................***Failed  Error regular expression found in output. Regex=[FAILED]128.11 sec
 928/1473 Test  #928: blocked_bad_ip .................................................***Failed  Error regular expression found in output. Regex=[FAILED]128.16 sec
 929/1473 Test  #929: blocked_bad_ip-no-syscallbuf ...................................***Failed  Error regular expression found in output. Regex=[FAILED]128.19 sec
...

Manual run of the first one (not sure if I am correctly running it):

/usr/lib64/rr/testsuite/obj# ./bin/vfork_done
EXIT-SUCCESS
/usr/lib64/rr/testsuite/obj# rr ./bin/vfork_done
rr: Saving execution to trace directory `/root/.local/share/rr/vfork_done-0'.
EXIT-SUCCESS
/usr/lib64/rr/testsuite/obj# /bin/bash "source_dir/src/test/basic_test.run" "vfork_done" "" "bin_dir" "120"
Test 'vfork_done' PASSED
/usr/lib64/rr/testsuite/obj# ctest -R vfork_done
Test project /usr/lib64/rr/testsuite/obj
    Start  852: vfork_done
1/4 Test  #852: vfork_done .......................   Passed    1.47 sec
    Start  853: vfork_done-no-syscallbuf
2/4 Test  #853: vfork_done-no-syscallbuf .........   Passed    1.47 sec
    Start 1458: vfork_done_clone
3/4 Test #1458: vfork_done_clone .................   Passed    1.47 sec
    Start 1459: vfork_done_clone-no-syscallbuf
4/4 Test #1459: vfork_done_clone-no-syscallbuf ...   Passed    1.46 sec

100% tests passed, 0 tests failed out of 4

Total Test time (real) =   5.89 sec

So this fails only when run in full test suite?

vt-alt commented 1 year ago

So this fails only when run in full test suite?

Ok next test failed even on standalone run:

/usr/lib64/rr/testsuite/obj# ctest -R alternate_thread_diversion
Test project /usr/lib64/rr/testsuite/obj
    Start 894: alternate_thread_diversion
1/2 Test #894: alternate_thread_diversion .................***Failed  Error regular expression found in output. Regex=[FAILED]128.10 sec
    Start 895: alternate_thread_diversion-no-syscallbuf
2/2 Test #895: alternate_thread_diversion-no-syscallbuf ...***Failed  Error regular expression found in output. Regex=[FAILED]128.10 sec

0% tests passed, 2 tests failed out of 2

Total Test time (real) = 256.22 sec

The following tests FAILED:
        894 - alternate_thread_diversion (Failed)
        895 - alternate_thread_diversion-no-syscallbuf (Failed)
Errors while running CTest
GitMensch commented 1 year ago

Now is the time for... ctest -R alternate_thread_diversion -vv

vt-alt commented 1 year ago

I don't see the difference from ctest -R alternate_thread_diversion -VV output, but here it is. alternate_thread_diversion.log

GitMensch commented 1 year ago

The error show in the log is that it cannot find any debug symbols:

Reading symbols from /tmp/rr-test-alternate_thread_diversion-VoXtkWN6a/alternate_thread_diversion-VoXtkWN6a-0/mmap_hardlink_4_alternate_thread_diversion-VoXtkWN6a... Missing separate debuginfo for /tmp/rr-test-alternate_thread_diversion-VoXtkWN6a/alternate_thread_diversion-VoXtkWN6a-0/mmap_hardlink_4_alternate_thread_diversion-VoXtkWN6a. Try to install the hash file /usr/lib/debug/.build-id/21/2cb1258b216d5f5bc8b4bf6f64c2ffc6e8d371.debug (No debugging symbols found in /tmp/rr-test-alternate_thread_diversion-VoXtkWN6a/alternate_thread_diversion-VoXtkWN6a-0/mmap_hardlink_4_alternate_thread_diversion-VoXtkWN6a)

What does file say to this binary? Here's my output:

$ file bin/alternate_thread_diversion
bin/alternate_thread_diversion: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4a08cbffd1e146bd76373cf27e869db3e615fd7e, for GNU/Linux 3.2.0, with debug_info, not stripped
$ gdb -q bin/alternate_thread_diversion
Reading symbols from bin/alternate_thread_diversion...
(gdb) quit
vt-alt commented 1 year ago

Thanks! After installing rr-testsuite-debuginfo alternate_thread_diversion tests passed:

/usr/lib64/rr/testsuite/obj# ctest -R alternate_thread_diversion -VV | tee log6
UpdateCTestConfiguration  from :/usr/lib64/rr/testsuite/obj/DartConfiguration.tcl
UpdateCTestConfiguration  from :/usr/lib64/rr/testsuite/obj/DartConfiguration.tcl
Test project /usr/lib64/rr/testsuite/obj
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 894
    Start 894: alternate_thread_diversion

894: Test command: /bin/bash "source_dir/src/test/alternate_thread_diversion.run" "alternate_thread_diversion" "" "bin_dir" "120"
894: Working Directory: /usr/lib64/rr/testsuite/obj
894: Test timeout computed to be: 10000000
894: Test 'alternate_thread_diversion' PASSED
1/2 Test #894: alternate_thread_diversion .................   Passed    0.52 sec
test 895
    Start 895: alternate_thread_diversion-no-syscallbuf

895: Test command: /bin/bash "source_dir/src/test/alternate_thread_diversion.run" "alternate_thread_diversion" "-n" "bin_dir" "120"
895: Working Directory: /usr/lib64/rr/testsuite/obj
895: Test timeout computed to be: 10000000
895: Test 'alternate_thread_diversion' PASSED
2/2 Test #895: alternate_thread_diversion-no-syscallbuf ...   Passed    0.50 sec

The following tests passed:
        alternate_thread_diversion
        alternate_thread_diversion-no-syscallbuf

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   1.04 sec

After debuginfo install:

/usr/lib64/rr/testsuite/obj# file bin/alternate_thread_diversion
bin/alternate_thread_diversion: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=212cb1258b216d5f5bc8b4bf6f64c2ffc6e8d371, for GNU/Linux 3.2.0, stripped
/usr/lib64/rr/testsuite/obj# gdb -q bin/alternate_thread_diversion
Reading symbols from bin/alternate_thread_diversion...
Reading symbols from /usr/lib/debug/usr/lib64/rr/testsuite/obj/bin/alternate_thread_diversion.debug...
(gdb) quit

I will re-run all tests again today.

Does having debug symbols important for these tests?

I was running tests after build in virtual machine but because some tests fail I split testsuite into separate package (like in OpenSUSE) but I didn't realize installing debuginfo is also important to run them.

GitMensch commented 1 year ago

The debuginfo for the tests that are built in this package are needed to run these tests... You normally would not "install" the test programs so you won't have their debuginfo either.

vt-alt commented 1 year ago

Thanks for clarification! I will not separate debuginfo from testsuite then.

But (with debuginfo installed) there are still 16 test failures:

1242/1473 Test #1242: watchpoint_at_sched ............................................***Failed  Error regular expression found in output. Regex=[FAILED]128.09 sec
1243/1473 Test #1243: watchpoint_at_sched-no-syscallbuf ..............................***Failed  Error regular expression found in output. Regex=[FAILED]128.11 sec
1244/1473 Test #1244: watchpoint_before_signal .......................................***Failed  Error regular expression found in output. Regex=[FAILED]128.11 sec
1245/1473 Test #1245: watchpoint_before_signal-no-syscallbuf .........................***Failed  Error regular expression found in output. Regex=[FAILED]128.11 sec
1370/1473 Test #1370: replay_serve_files .............................................***Failed  Error regular expression found in output. Regex=[FAILED]130.11 sec
1371/1473 Test #1371: replay_serve_files-no-syscallbuf ...............................***Failed  Error regular expression found in output. Regex=[FAILED]130.13 sec
1392/1473 Test #1392: reverse_step_signal ............................................***Failed  Error regular expression found in output. Regex=[FAILED]128.12 sec
1393/1473 Test #1393: reverse_step_signal-no-syscallbuf ..............................***Failed  Error regular expression found in output. Regex=[FAILED]128.12 sec
1396/1473 Test #1396: reverse_watchpoint .............................................***Failed  Error regular expression found in output. Regex=[FAILED]128.11 sec
1397/1473 Test #1397: reverse_watchpoint-no-syscallbuf ...............................***Failed  Error regular expression found in output. Regex=[FAILED]128.10 sec
1442/1473 Test #1442: tick0 ..........................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.26 sec
...

Full log LastTest.7.log

Manually poking the first one:

/usr/lib64/rr/testsuite/obj# ./bin/watchpoint_at_sched
EXIT-SUCCESS
/usr/lib64/rr/testsuite/obj# rr ./bin/watchpoint_at_sched
rr: Saving execution to trace directory `/root/.local/share/rr/watchpoint_at_sched-0'.
EXIT-SUCCESS
/usr/lib64/rr/testsuite/obj# ctest -R watchpoint_at_sched
Test project /usr/lib64/rr/testsuite/obj
    Start 1242: watchpoint_at_sched
1/2 Test #1242: watchpoint_at_sched .................***Failed  Error regular expression found in output. Regex=[FAILED]128.09 sec
    Start 1243: watchpoint_at_sched-no-syscallbuf
2/2 Test #1243: watchpoint_at_sched-no-syscallbuf ...***Failed  Error regular expression found in output. Regex=[FAILED]128.09 sec

0% tests passed, 2 tests failed out of 2

I also noticed that hardlink_mmapped_files (which is Passed) and replay_serve_files (which is Failed) have error cp: cannot stat '/usr/lib64/rr/testsuite/obj/lib/rr/libtest_lib.so': No such file or directory while that library is installed into /usr/lib64/rr/libtest_lib.so. Maybe this is not important since first test is Passed.

GitMensch commented 1 year ago

ctest --rerun-failed -vv

Will rerun only the failed tests and give all details for those.

vt-alt commented 1 year ago
/usr/lib64/rr/testsuite/obj# ctest --rerun-failed -vv |& tee log8
Test project /usr/lib64/rr/testsuite/obj
    Start 1242: watchpoint_at_sched
1/2 Test #1242: watchpoint_at_sched .................***Failed  Error regular expression found in output. Regex=[FAILED]128.10 sec
    Start 1243: watchpoint_at_sched-no-syscallbuf
2/2 Test #1243: watchpoint_at_sched-no-syscallbuf ...***Failed  Error regular expression found in output. Regex=[FAILED]128.10 sec

0% tests passed, 2 tests failed out of 2

Total Test time (real) = 256.24 sec

The following tests FAILED:
        1242 - watchpoint_at_sched (Failed)
        1243 - watchpoint_at_sched-no-syscallbuf (Failed)
Errors while running CTest
Output from these tests are in: /usr/lib64/rr/testsuite/obj/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

LastTest.log: LastTest.8.log

ps. If you need other tests I should rerun full tests again. But doesn't previous logs include that? Also, are you sure you need -vv and not -VV ? ...Will rerun with both.

GitMensch commented 1 year ago

Oops - https://github.com/rr-debugger/rr/wiki/Building-And-Installing#tests so yes, I've meant -VV. Maybe it is time to close this issue with a note that this was mainly a problem of stripping debug info from test modules and create one issue for each "group" of failing tests that don't have an issue yet (using search function first), each with uname -a && cat/proc/cpuinfo along with verbose logs for that test.

vt-alt commented 1 year ago

OK lets close this one. ~BTW, I don't see a difference between normal ctest and ctest -VV in LastTest.log. They seems only differ on the stdout that I didn't attach because it's same or less than LastTest.log.~

~ps. Oh yes formatting is different so I cannot really grep on -VV log easily.~ Never mind this please.

vt-alt commented 5 months ago

JFYI In 5.8.0 in the same conditions these tests failed:

98% tests passed, 24 tests failed out of 1515

Total Test time (real) = 3939.12 sec

The following tests FAILED:
         88 - clone_vfork_pidfd (Failed)
         89 - clone_vfork_pidfd-no-syscallbuf (Failed)
        768 - sioc (Failed)
        769 - sioc-no-syscallbuf (Failed)
        878 - vfork_done (Failed)
        879 - vfork_done-no-syscallbuf (Failed)
        1196 - step_into_lib (Failed)
        1197 - step_into_lib-no-syscallbuf (Failed)
        1274 - vsyscall_reverse_next (Failed)
        1275 - vsyscall_reverse_next-no-syscallbuf (Failed)
        1410 - replay_serve_files (Failed)
        1411 - replay_serve_files-no-syscallbuf (Failed)
        1432 - reverse_step_signal (Failed)
        1433 - reverse_step_signal-no-syscallbuf (Failed)
        1436 - reverse_watchpoint (Failed)
        1437 - reverse_watchpoint-no-syscallbuf (Failed)
        1470 - step_signal (Failed)
        1471 - step_signal-no-syscallbuf (Failed)
        1482 - tick0 (Failed)
        1483 - tick0-no-syscallbuf (Failed)
        1484 - tick0_less (Failed)
        1485 - tick0_less-no-syscallbuf (Failed)
        1512 - watchpoint_unaligned2 (Failed)
        1513 - watchpoint_unaligned2-no-syscallbuf (Failed)
Errors while running CTest