gramineproject / graphene

Graphene / Graphene-SGX - a library OS for Linux multi-process applications, with Intel SGX support
https://grapheneproject.io
GNU Lesser General Public License v3.0
765 stars 262 forks source link

LTP tests flaky (timeouts) #1297

Closed HW42 closed 3 years ago

HW42 commented 4 years ago

With current master (d379bba600e18dbdcfb9544916a223ebc35ea5bd) LTP is flaky for me. And it seems I'm much more (un)lucky than Jenkins.

Beside the timeouts I describe here I have seen some other failures with at a much lower rate (FAIL of epoll_wait02 and select04 and timeout+Python exception for LTP.kill12), which I will ignore here.

Typical output:

./runltp_xml.py -c ltp-bug-1248.cfg -c ltp.cfg /home/user/tmp/work1/LibOS/shim/test/apps/ltp//opt/ltp/runtest/syscalls > ltp.xml
2020-02-04 15:33:32,597 LTP.splice02: -> SKIP (invalid shell command)
2020-02-04 15:33:53,502 LTP.clock_nanosleep02: -> SKIP (all subtests skipped (FAILED=[] NOTFOUND=[] passed=[] dontcare=[] skipped=[1, 2, 3, 4, 5, 6, 7] returncode=1))
2020-02-04 15:34:27,790 LTP.writev01: cannot extract stdio on python < 3.7
2020-02-04 15:34:27,790 LTP.writev01: -> ERROR (Timed out after 30.0 s.)
2020-02-04 15:34:57,200 LTP.fcntl04_64: cannot extract stdio on python < 3.7
2020-02-04 15:34:57,200 LTP.fcntl04_64: -> ERROR (Timed out after 30.0 s.)
2020-02-04 15:34:57,740 LTP.accept01: cannot extract stdio on python < 3.7
2020-02-04 15:34:57,740 LTP.accept01: -> ERROR (Timed out after 30.0 s.)
2020-02-04 15:35:12,529 LTP.setrlimit02: cannot extract stdio on python < 3.7
2020-02-04 15:35:12,529 LTP.setrlimit02: -> ERROR (Timed out after 30.0 s.)
2020-02-04 15:35:40,763 LTP.getcwd01: cannot extract stdio on python < 3.7
2020-02-04 15:35:40,764 LTP.getcwd01: -> ERROR (Timed out after 30.0 s.)
2020-02-04 15:35:40,928 LTP.getcwd02: cannot extract stdio on python < 3.7
2020-02-04 15:35:40,928 LTP.getcwd02: -> ERROR (Timed out after 30.0 s.)
2020-02-04 15:35:40,943 LTP: LTP finished tests=972 failures=0 errors=6 skipped=904 returncode=6
Makefile:48: recipe for target 'ltp.xml' failed
make: *** [ltp.xml] Error 6

(The SKIPs are normal)

It seems not to be test specific (statistic over 27 runs):

      9 LTP.setrlimit02: -> ERROR (Timed out after 30.0 s.)
      8 LTP.write05: -> ERROR (Timed out after 40.0 s.)
      8 LTP.setsockopt03: -> ERROR (Timed out after 30.0 s.)
      8 LTP.fcntl03: -> ERROR (Timed out after 30.0 s.)
      7 LTP.open08: -> ERROR (Timed out after 30.0 s.)
      7 LTP.fchmod01: -> ERROR (Timed out after 30.0 s.)
      6 LTP.uname04: -> ERROR (Timed out after 30.0 s.)
      6 LTP.pipe01: -> ERROR (Timed out after 30.0 s.)
      6 LTP.getcwd02: -> ERROR (Timed out after 30.0 s.)
      6 LTP.clock_gettime01: -> ERROR (Timed out after 30.0 s.)
      5 LTP.writev01: -> ERROR (Timed out after 30.0 s.)
      5 LTP.setrlimit03: -> ERROR (Timed out after 30.0 s.)
      5 LTP.rmdir01: -> ERROR (Timed out after 30.0 s.)
      5 LTP.pwrite02: -> ERROR (Timed out after 30.0 s.)
      5 LTP.prctl02: -> ERROR (Timed out after 30.0 s.)
      5 LTP.poll01: -> ERROR (Timed out after 30.0 s.)
      5 LTP.fcntl04_64: -> ERROR (Timed out after 30.0 s.)
      5 LTP.fcntl02_64: -> ERROR (Timed out after 40.0 s.)
      5 LTP.epoll_ctl02: -> ERROR (Timed out after 30.0 s.)
      5 LTP.epoll_create1_01: -> ERROR (Timed out after 30.0 s.)
      5 LTP.clock_nanosleep02: -> ERROR (Timed out after 30.0 s.)
      5 LTP.clock_getres01: -> ERROR (Timed out after 30.0 s.)
      5 LTP.brk01: -> ERROR (Timed out after 40.0 s.)
      5 LTP.bind03: -> ERROR (Timed out after 30.0 s.)
      4 LTP.socketpair02: -> ERROR (Timed out after 30.0 s.)
      4 LTP.socket02: -> ERROR (Timed out after 40.0 s.)
      4 LTP.socket01: -> ERROR (Timed out after 40.0 s.)
      4 LTP.flock06: -> ERROR (Timed out after 30.0 s.)
      4 LTP.fcntl02: -> ERROR (Timed out after 30.0 s.)
      4 LTP.clock_gettime02: -> ERROR (Timed out after 30.0 s.)
      4 LTP.alarm02: -> ERROR (Timed out after 40.0 s.)
      4 LTP.accept01: -> ERROR (Timed out after 30.0 s.)
      3 LTP.socketpair01: -> ERROR (Timed out after 30.0 s.)
      3 LTP.pselect03_64: -> ERROR (Timed out after 30.0 s.)
      3 LTP.lseek01: -> ERROR (Timed out after 30.0 s.)
      3 LTP.getcwd01: -> ERROR (Timed out after 30.0 s.)
      3 LTP.fcntl04: -> ERROR (Timed out after 30.0 s.)
      3 LTP.creat03: -> ERROR (Timed out after 30.0 s.)
      2 LTP.syscall01: -> ERROR (Timed out after 80.0 s.)
      2 LTP.setuid01: -> ERROR (Timed out after 40.0 s.)
      2 LTP.sendto02: -> ERROR (Timed out after 30.0 s.)
      2 LTP.sched_getaffinity01: -> ERROR (Timed out after 30.0 s.)
      2 LTP.read01: -> ERROR (Timed out after 40.0 s.)
      2 LTP.pwrite02_64: -> ERROR (Timed out after 30.0 s.)
      2 LTP.kill12: -> ERROR (Timed out after 30.0 s.)
      2 LTP.fcntl03_64: -> ERROR (Timed out after 30.0 s.)
      2 LTP.accept4_01: -> ERROR (Timed out after 30.0 s.)
      1 LTP.write03: -> ERROR (Timed out after 40.0 s.)
      1 LTP.stat01: -> ERROR (Timed out after 30.0 s.)
      1 LTP.lseek07: -> ERROR (Timed out after 30.0 s.)
      1 LTP.epoll_ctl01: -> ERROR (Timed out after 30.0 s.)
      1 LTP.creat01: -> ERROR (Timed out after 30.0 s.)

Reproducer (at lest on my test machine):

It will either exit cleanly or printing signal queue is full (TID = 1, SIG = 11) as fast as it can. My current understanding is that there are two parts here (this is puzzled together from log + gdb + reading source, there are likely some errors or missing details):

In LTP's test case handling (LibOS/shim/test/apps/ltp/src/lib/tst_test.c):

LibOS handling:

GDB backtrace Since GDB didn't like it when running the binary directly this is after attaching to the already looping process: ``` (gdb) bt #0 char_write (handle=, offset=, size=41, buffer=0x7f32e5736444 ) at db_devices.c:213 #1 0x00007f32e57397fb in _DkStreamWrite (handle=handle@entry=0x7f32e59568d0 , offset=, count=, buf=, addr=, addrlen=) at db_streams.c:327 #2 0x00007f32e5739883 in DkStreamWrite (handle=0x7f32e59568d0 , offset=, count=, buffer=, dest=) at db_streams.c:345 #3 0x00007f32e5438bdd in sys_fputs (f=0x7f32e59568d0 , str=0x7f32e5736444 "signal queue is full (TID = 1, SIG = 11)\n", len=41) at utils/printf.c:135 #4 0x00007f32e5438c40 in sys_fputch (f=0x7f32e59568d0 , ch=10, b=0x0) at utils/printf.c:144 #5 0x00007f32e54c79f1 in vfprintfmt (_fputch=0x7f32e5438be0 , f=, putdat=, fmt=0x7f32e54caafa "", ap=0x54f3cc336178) at stdlib/printfmt.c:93 #6 0x00007f32e5438cea in sys_vfprintf (hdl=0x7f32e59568d0 , fmt=0x7f32e54caad0 "signal queue is full (TID = %u, SIG = %d)\n", ap=0x54f3cc336178) at utils/printf.c:156 #7 0x00007f32e5438d93 in handle_printf (hdl=0x7f32e59568d0 , fmt=0x7f32e54caad0 "signal queue is full (TID = %u, SIG = %d)\n") at utils/printf.c:162 #8 0x00007f32e54358ae in deliver_signal (info=0x54f3cc336690, context=0x54f3cc3367b0) at bookkeep/shim_signal.c:183 #9 0x00007f32e5435eb9 in memfault_upcall (event=0x0, arg=93406033698832, context=0x54f3cc3367b0) at bookkeep/shim_signal.c:304 #10 0x00007f32e5746a69 in _DkGenericEventTrigger (uc=0x54f3cc3368c0, arg=93406033698832, upcall=0x7f32e5435c0a ) at db_exception.c:190 #11 _DkGenericSignalHandle (event_num=, info=, uc=0x54f3cc3368c0) at db_exception.c:209 #12 0x00007f32e5746c73 in _DkGenericSighandler (signum=, info=, uc=) at db_exception.c:253 #13 #14 0x00007f32e52069ab in heartbeat_handler (sig=10) at tst_test.c:1047 #15 0x00007f32e5436d34 in __handle_one_signal (tcb=0x7f32e5954010 , sig=10, signal=0x7f32e51f4f70) at bookkeep/shim_signal.c:686 #16 0x00007f32e5436e98 in __handle_signal (tcb=0x7f32e5954010 , sig=10) at bookkeep/shim_signal.c:720 #17 0x00007f32e54367d4 in resume_upcall (event=0x0, arg=0, context=0x0) at bookkeep/shim_signal.c:575 #18 0x00007f32e5746a19 in _DkGenericEventTrigger (uc=0x0, arg=0, upcall=0x7f32e5436768 ) at db_exception.c:183 #19 _DkGenericSignalHandle (event_num=, info=info@entry=0x0, uc=uc@entry=0x0) at db_exception.c:209 #20 0x00007f32e5746ee4 in __check_pending_event () at db_exception.c:319 #21 0x00007f32e573a5d4 in DkThreadDelayExecution (duration=18446744073709551606) at db_threading.c:61 #22 0x00007f32e545552c in terminate_ipc_helper () at ipc/shim_ipc_helper.c:927 #23 0x00007f32e54a8ec7 in thread_or_process_exit (error_code=0, term_signal=0) at sys/shim_exit.c:161 #24 0x00007f32e54a9012 in shim_do_exit_group (error_code=0) at sys/shim_exit.c:211 #25 0x00007f32e548bbfb in __shim_exit_group (__arg1=0) at shim_syscalls.c:802 #26 0x00007f32e549fe9c in syscalldb () at syscallas.S:86 #27 0x000054f3c981c383 in execve () at ../sysdeps/unix/syscall-template.S:78 #28 0x000054f3c97d0ad9 in _IO_file_setbuf_mmap (fp=0xa, p=0x7f32e51f4f70 "\n", len=139856569126904) at fileops.c:408 #29 0x000054f3c97d4813 in _IO_unbuffer_all () at genops.c:887 #30 _IO_cleanup () at genops.c:930 #31 0x000054f3c9790ea2 in __on_exit (func=0xa, arg=0x7f32e51f4f70) at on_exit.c:28 #32 0x0000000000000000 in ?? () (gdb) ```

So there are 2 questions:

yamahata commented 4 years ago

I guess on native, before fork_testrun() issues waitpid before do_exit()->do_cleanup(), SIGUSR1 was already handled. So it isn't an issue on native. Child process surely kill(SIGUSR1) and then exit.

With LibOS, waitpid success and later SIGUSR1 is handled in parent process. I haven't tracked why SIGUSR1 delays after wait4 yet. need to dig into ipc helper.

dimakuv commented 4 years ago

I didn't get this part:

  • heartbeat_handler tries to derref results and receives SIGSEGV.

What happens next? Does this LTP test fail? This is native, without any Graphene, right?

dimakuv commented 4 years ago

Generally, it looks like a bad idea to have any signal processing after thread_or_process_exit(). If we reached the point where we perform terminate_ipc_helper(), we are clearly "almost dead" and no signals should be allowed. This would solve this particular bug.

I'm not sure I follow Isaku's comment about wait4(), this seems to be some other bug which contributes to this bug.

yamahata commented 4 years ago

The channels for sending signal and notifying child exit are different and there is no synchronization among them.

So the reorder can happen. Child processes IPC_PID_KILL for kill(SIGUSR1) and then IPC_CLD_EXIT for exit(). But parent can process them in reverse order.

I'm not sure about how POSIX defines such corner case. I guess it's allowed for signal delivery to delay. But in practical such a delayed signal delivery isn't assumed, I guess.

dimakuv commented 4 years ago

I see. Thanks for explanation!

woju commented 4 years ago

Apart from any particular bug there, do you have recommendation on how to detect subtle bugs, which won't immediately cause the testsuite to fail, but introduce flakiness?

We could run the testsuite multiple times, but currently this is not done, because this would take too much time (both real time and Jenkins' CPU time). Instead, maintainers sometimes feel like they should do it manually and just write "[J]enkins retest please" 5 times or so.

dimakuv commented 4 years ago

@woju Can we have a Jenkins job that runs every night (by night I mean some US time) for ~8 hours and reports how many tests failed? It should also show what PRs were merged from the previous run (the previous night). It won't be proactive but at least retroactively we'll have good understanding if some PRs introduced flakiness.

HW42 commented 4 years ago

@dimakuv

I didn't get this part:

  • heartbeat_handler tries to derref results and receives SIGSEGV.

What happens next? Does this LTP test fail? This is native, without any Graphene, right?

No, I only describe what happens when running on Graphene's 'Linux' host. The first part is what happens in LTP code and the second part what happens in LibOS code. Of course in reality they are interleaved as can be seen in the GDB backtrace.

If LibOS wouldn't loop (I think trying to continue execution at the same instruction after a segfault is clearly a bug), yes then that LTP testcase would crash. I don't know if this is a bug in LTP or if LibOS should never allow this ordering of waiting for the child and delivering of SIGUSR1. I didn't test it, but since LTP is created to stress test Linux I assumed that this bug isn't triggered on a native Linux.

HW42 commented 4 years ago

@woju:

Apart from any particular bug there, do you have recommendation on how to detect subtle bugs, which won't immediately cause the testsuite to fail, but introduce flakiness?

As @dimakuv already proposed I think a nightly job (or even weekly would help) that loops as often as reasonably possible is a good solution. This not only allows spotting regressions but also helps to asses flaky results on PRs (i.e. If test X has failed for PR123 but succeeded on retry, you can look at the recent nightlies to asses if this is likely to be related to the PR).

woju commented 4 years ago

@dimakuv We could, but depending on how much time the run would take we would have trouble to schedule it so it happens during the night in both our time zones.

We have 4 important pipelines ({16.04, 18.04} * {no sgx, sgx} and pick one of {no debug, debug}), 5 testsuite loops, each 30 min. Say it takes 5 hours on two concurrent Jenkins runners. So if we'd like this to be finished by, say, 09:00 in any time zone, the worst case this is 09:00 CEST (UTC+2), so this would have to start 02:00 UTC, which is, still in worst case, 18:00 PST. Now this worst case happens only for one week in autumn and two or three weeks in sprint, but nonetheless is a real problem. We could push this a few hours forward, so it would finish by, say 11:00 CEST, but this would still be 20:00 PST, and I think I saw you working at that time more than once.

My time zone cheat sheet
NA standard -> summer: 2nd Sunday in March
EU standard -> summer: last Sunday in March
EU summer -> standard: last Sunday in October
NA summer -> standard: 1st Sunday in November
PST     PDT     AST     ADT     UTC     CET     CEST
16:00   17:00   20:00   21:00   00:00   01:00   02:00
17:00   18:00   21:00   22:00   01:00   02:00   03:00
18:00   19:00   22:00   23:00   02:00   03:00   04:00
19:00   20:00   23:00   00:00   03:00   04:00   05:00
20:00   21:00   00:00   01:00   04:00   05:00   06:00
21:00   22:00   01:00   02:00   05:00   06:00   07:00
22:00   23:00   02:00   03:00   06:00   07:00   08:00
23:00   00:00   03:00   04:00   07:00   08:00   09:00
00:00   01:00   04:00   05:00   08:00   09:00   10:00
01:00   02:00   05:00   06:00   09:00   10:00   11:00
02:00   03:00   06:00   07:00   10:00   11:00   12:00
03:00   04:00   07:00   08:00   11:00   12:00   13:00
04:00   05:00   08:00   09:00   12:00   13:00   14:00
05:00   06:00   09:00   10:00   13:00   14:00   15:00
06:00   07:00   10:00   11:00   14:00   15:00   16:00
07:00   08:00   11:00   12:00   15:00   16:00   17:00
08:00   09:00   12:00   13:00   16:00   17:00   18:00
09:00   10:00   13:00   14:00   17:00   18:00   19:00
10:00   11:00   14:00   15:00   18:00   19:00   20:00
11:00   12:00   15:00   16:00   19:00   20:00   21:00
12:00   13:00   16:00   17:00   20:00   21:00   22:00
13:00   14:00   17:00   18:00   21:00   22:00   23:00
14:00   15:00   18:00   19:00   22:00   23:00   00:00
15:00   16:00   19:00   20:00   23:00   00:00   01:00
HW42 commented 4 years ago

Let's move the testing discussion into #1298.

yamahata commented 4 years ago

1218 addresses non-delivery of sigal after exit

and In the next respin it will address avoid nesting of fatal signal.

yamahata commented 4 years ago

tst_test.c uses SA_RESTART for sigaction for SIGUSR1 If waitpid returns EINTR, SAFE_WAITPID causes test failure.

On the other hande, LibOS doesn't emulate SA_RESTART, but ignores the flag.

HW42 commented 4 years ago

@yamahata:

1218 addresses non-delivery of sigal after exit

and In the next respin it will address avoid nesting of fatal signal.

Did you pushed your updated version somewhere? AFAICS you didn't pushed anything new to #1218.

HW42 commented 4 years ago

Hmm, given that #1218 is already quite big, how about putting this into a separate PR? (If you need your changes from #1218, make that PR depend on #1218)

yamahata commented 4 years ago

I haven't updated #1218 yet. +1 for new PR.

stefanberger commented 4 years ago

I would hope that the situation has improved over the last few months. However, I also see a few errors when I run ltp inside a Xen VM:

./runltp_xml.py -c ltp-bug-1248.cfg -c ltp.cfg /root/graphene.upstream/LibOS/shim/test/ltp//opt/ltp/runtest/syscalls > ltp.xml
2020-06-23 18:04:32,414 LTP.splice02: -> SKIP (invalid shell command)
2020-06-23 18:04:37,804 LTP.dup07: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1, 2, 3] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:04:38,923 LTP.dup202: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1, 2, 3] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:05:06,375 LTP.fcntl28: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:05:06,416 LTP.fcntl28_64: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:05:55,935 LTP.clock_nanosleep02: -> SKIP (all subtests skipped (FAILED=[] NOTFOUND=[] passed=[] dontcare=[1, 2, 3, 4, 5, 6, 7] skipped=[] returncode=0))
2020-06-23 18:05:55,940 LTP: LTP finished tests=952 failures=0 errors=4 skipped=895 returncode=4
Makefile:50: recipe for target 'ltp.xml' failed
make: *** [ltp.xml] Error 4

This report is quite consistent between runs. Each of the failing test cases, when run individually, easily passes, though.

dimakuv commented 4 years ago

It would be good to finally debug and fix LTP tests, but no time for this...

stefanberger commented 4 years ago

This Xen VM may be bad... With #1617 I now get this here on Fedora 31:

/contrib/conf_lint.py < ltp.cfg
./runltp_xml.py -c ltp-bug-1248.cfg -c ltp.cfg /home/stefanb/tmp/graphene.upstream/LibOS/shim/test/ltp//opt/ltp/runtest/syscalls > ltp.xml
2020-06-24 13:46:59,443 LTP.splice02: -> SKIP (invalid shell command)
2020-06-24 13:47:08,928 LTP.clock_nanosleep02: -> SKIP (all subtests skipped (FAILED=[] NOTFOUND=[] passed=[] dontcare=[1, 2, 3, 4, 5, 6, 7] skipped=[] returncode=0))
2020-06-24 13:47:35,913 LTP: LTP finished tests=951 failures=0 errors=0 skipped=894 returncode=0

All clean!

mkow commented 3 years ago

We haven't seen any of these timeouts in CI after merging #1949 almost two weeks ago, so, seems we can finally close this issue :partying_face: