shadow / shadow

Shadow is a discrete-event network simulator that directly executes real application code, enabling you to simulate distributed systems with thousands of network-connected processes in realistic and scalable private network experiments using your laptop, desktop, or server running Linux.
https://shadow.github.io
Other
1.45k stars 241 forks source link

flaky tests (timeouts) #824

Closed sporksmith closed 3 years ago

sporksmith commented 4 years ago

Describe the issue

The shadow-preload tests are a bit flaky. This is masked a bit in our CI because we automatically retry failed tests once, and allow the workflow to succeed if they pass on the second try.

To Reproduce

I've seen these failures occasionally in the CI, but don't have a pointer handy to one. I was able to reproduce locally with:

CC=gcc ./setup build --clean --debug --test -j8 && ./setup install
cd build
ctest -R shadow-preload -j8 --repeat-until-fail 1000 --output-on-failure --timeout 1

Output:

25% tests passed, 6 tests failed out of 8

Total Test time (real) = 106.21 sec

The following tests FAILED:
      7 - bind-shadow-preload (Timeout)
     10 - cpp-shadow-preload (Timeout)
     15 - file-shadow-preload (Timeout)
     24 - sleep-shadow-preload (Timeout)
     30 - udp-shadow-preload (Timeout)
     33 - unistd-shadow-preload (Timeout)

Operating System (please complete the following information):

  • OS and version: post the output of lsb_release -d

Description: Ubuntu 18.04.4 LTS

  • Kernel version: post the output of uname -a

Linux vusiri 5.3.0-53-generic #47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Shadow (please complete the following information):

  • Version: post the output of shadow --version

Shadow v1.13.2-883-ged221d07-dirty 2020-06-05 (built 2020-06-05) running GLib v2.56.4 and IGraph v0.7.1

Additional context Add any other context about the problem here.

sporksmith commented 4 years ago

The full test output is over a gig. Filtered to just the failing output with:

tac LastTest.log | sed -n '/Test Failed./,/Testing: /p' | tac > failing_tests.txt

failing_tests.txt

sporksmith commented 4 years ago

From failing_tests.txt, above, only two of the tests had output.

Tail of file-shadow-preload:

00:00:00.014087 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [syscall_handler.c:118] [_syscallhandler_post_syscall] SYSCALL_HANDLER_POST(testfile,pid=1000): syscall 228 clock_gettime result: state=DONE code=0
00:00:00.014234 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [thread_preload.c:239] [_threadpreload_waitForNextEvent] received shim_event 3
00:00:00.014246 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [syscall_handler.c:103] [_syscallhandler_pre_syscall] SYSCALL_HANDLER_PRE(testfile,pid=1000): handling syscall 228 clock_gettime
00:00:00.014252 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [time.c:84] [syscallhandler_clock_gettime] syscallhandler_clock_gettime with 1 0x7ffd09ebc140
00:00:00.014258 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [syscall_handler.c:118] [_syscallhandler_post_syscall] SYSCALL_HANDLER_POST(testfile,pid=1000): syscall 228 clock_gettime result: state=DONE code=0

Tail of cpp-shadow-preload:

00:00:00.011583 [thread-0] 00:00:02.000000000 [debug] [testnode~11.0.0.1] [thread_preload.c:239] [_threadpreload_waitForNextEvent] received shim_event 3
00:00:00.011594 [thread-0] 00:00:02.000000000 [debug] [testnode~11.0.0.1] [syscall_handler.c:103] [_syscallhandler_pre_syscall] SYSCALL_HANDLER_PRE(testcpp,pid=1000): handling syscall 228 clock_gettime
00:00:00.011607 [thread-0] 00:00:02.000000000 [debug] [testnode~11.0.0.1] [time.c:84] [syscallhandler_clock_gettime] syscallhandler_clock_gettime with 0 0x7ffe86f62e30
00:00:00.011614 [thread-0] 00:00:02.000000000 [debug] [testnode~11.0.0.1] [syscall_handler.c:118] [_syscallhandler_post_syscall] SYSCALL_HANDLER_POST(testcpp,pid=1000): syscall 228 clock_gettime result: state=DONE code=0
sporksmith commented 4 years ago

Tail of file.shadow-preload.dastnode/testnode.testfile.1000.stderr:

00:00:00.014303 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.014337 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 6 on 7
00:00:00.014347 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.014354 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.014363 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 4 on 7

file.shadow-preload.data/hosts/testnode/testnode.testfile.1000.stdout:

/file/newfile: OK
/file/open_close: OK
/file/write: OK
/file/read: OK
/file/fwrite: OK
/file/fread: OK
/file/iov:
sporksmith commented 4 years ago

The cpp test looks like the test program finished successfully, but then shadow hung.

$ cat cpp.shadow-preload.data/hosts/testnode/testnode.testcpp.1000.stderr 
00:00:00.010395 [00:00:00.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/shim.c:119] [_shim_load] waiting for event on 7
00:00:00.010446 [00:00:00.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.010455 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/shim.c:122] [_shim_load] starting main
00:00:00.010558 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:159] [clock_gettime] Making interposed syscall clock_gettime
00:00:00.010565 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:45] [_shadow_syscall_event] sending syscall event 228 on 7
00:00:00.010570 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
00:00:00.010581 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.010586 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.010775 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 7 on 7
00:00:00.010801 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:176] [open_explicit] Making real syscall open
00:00:00.010820 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:160] [close] Making real syscall close
00:00:00.010835 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
00:00:00.010845 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.010850 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.010856 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 6 on 7
00:00:00.010860 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.010864 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.010869 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 4 on 7
00:00:00.010874 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:174] [nanosleep] Making interposed syscall nanosleep
00:00:00.010894 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:45] [_shadow_syscall_event] sending syscall event 35 on 7
00:00:00.011160 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
00:00:00.011174 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.011182 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.011208 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 5 on 7
00:00:00.011218 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
00:00:00.011230 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.011241 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.011369 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 5 on 7
00:00:00.011380 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
00:00:00.011394 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.011402 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.011516 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 6 on 7
00:00:00.011526 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.011533 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.011541 [00:00:01.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 4 on 7
00:00:00.011548 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:159] [clock_gettime] Making interposed syscall clock_gettime
00:00:00.011555 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:45] [_shadow_syscall_event] sending syscall event 228 on 7
00:00:00.011561 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
00:00:00.011576 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.011583 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.011637 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 7 on 7
00:00:00.011649 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
00:00:00.011659 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.011666 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.011675 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 6 on 7
00:00:00.011680 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:50] [_shadow_syscall_event] waiting for event on 7
00:00:00.011684 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:182] [recv] Making real syscall recvfrom
00:00:00.011700 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:53] [_shadow_syscall_event] got response of type 4 on 7
00:00:04.973441 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/shim.c:135] [_shim_unload] sending stop event on 7
00:00:04.973462 [00:00:02.000000000] [shd-shim] [debug] [/home/jnewsome/projects/shadow/src/shim/preload_syscalls.c:184] [send] Making real syscall sendto
$ cat cpp.shadow-preload.data/hosts/testnode/testnode.testcpp.1000.stdout 
########## cpp test starting ##########
########## running test: _test_init()
hey 5
hey 0x562b626e2020 0x562b626e2020
test 2 4
test 0x562b626e2068 0x562b626e2068
local test 2 4
local test 0x7ffe86f62e80 0x7ffe86f62e80
########## running test: _test_endl()
Line 1
Line 2
Line 3
Line 1
Line 2
Line 3
Line 1
Line 2
Line 3

########## running test: _test_time()
########## cpp test passed! ##########
sporksmith commented 4 years ago

Unfortunately it seems whatever ctest uses to kill timed out tests doesn't generate core files, (despite running with ulimit -c unlimited).

One thing I noticed when trying to reproduce is that I could only do so while running multiple tests in parallel. e.g. I was able to run file-shadow-preload by itself 1000x successfully.

sporksmith commented 4 years ago

Hmm, actually I was able to reproduce in ptrace tests as well. (A previous attempt running all shadow tests didn't repro).

ctest -R shadow-ptrace -j8 --repeat-until-fail 1000 --output-on-failure --timeout 1
56% tests passed, 4 tests failed out of 9

Total Test time (real) = 151.35 sec

The following tests FAILED:
     14 - file-shadow-ptrace (Timeout)
     20 - shutdown-shadow-ptrace (Timeout)
     29 - udp-shadow-ptrace (Timeout)
     32 - unistd-shadow-ptrace (Timeout)
Errors while running CTest
sporksmith commented 4 years ago

ptrace test log looks similar:

failing_tests-ptrace.txt

file-shadow-ptrace:

00:00:00.017771 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [syscall_handler.c:118] [_syscallhandler_post_syscall] SYSCALL_HANDLER_POST(testfile,pid=1000): syscall 3 close result: state=NATIVE code=0
00:00:00.017780 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [thread_ptrace.c:462] [threadptrace_resume] THREAD_PTRACE_CHILD_STATE_SYSCALL_POST
00:00:00.017790 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [thread_ptrace.c:438] [threadptrace_resume] THREAD_PTRACE_CHILD_STATE_SYSCALL_PRE
00:00:00.017793 [thread-0] 00:00:01.000000000 [debug] [testnode~11.0.0.1] [syscall_handler.c:183] [syscallhandler_make_syscall] native syscall 257 openat
<end of output>
Test time =   1.01 sec

udp-shadow-ptrace

00:00:00.030622 [thread-0] 00:00:01.000000000 [message] [localnode~11.0.0.1] [process.c:249] [_process_start] process 'localnode.testudp.1000' started in 0.017668 seconds
00:00:00.030630 [thread-0] 00:00:01.000000000 [message] [localnode~11.0.0.1] [process.c:175] [_process_check] process 'localnode.testudp.1000' has completed or is otherwise no longer running
00:00:00.030636 [thread-0] 00:00:01.000000000 [message] [localnode~11.0.0.1] [process.c:147] [_process_logReturnCode] main success code '0' for process 'localnode.testudp.1000'
00:00:00.030644 [thread-0] 00:00:01.000000000 [message] [localnode~11.0.0.1] [process.c:183] [_process_check] total runtime for process 'localnode.testudp.1000' was 0.017668 seconds
<end of output>
Test time =   4.60 sec
sporksmith commented 3 years ago

This is a bit stale; I think some of the related issues have been fixed. Closing in favor of #1252