Closed HW42 closed 3 years ago
I guess on native, before fork_testrun() issues waitpid before do_exit()->do_cleanup(), SIGUSR1 was already handled. So it isn't an issue on native. Child process surely kill(SIGUSR1) and then exit.
With LibOS, waitpid success and later SIGUSR1 is handled in parent process. I haven't tracked why SIGUSR1 delays after wait4 yet. need to dig into ipc helper.
I didn't get this part:
heartbeat_handler
tries to derrefresults
and receivesSIGSEGV
.
What happens next? Does this LTP test fail? This is native, without any Graphene, right?
Generally, it looks like a bad idea to have any signal processing after thread_or_process_exit()
. If we reached the point where we perform terminate_ipc_helper()
, we are clearly "almost dead" and no signals should be allowed. This would solve this particular bug.
I'm not sure I follow Isaku's comment about wait4()
, this seems to be some other bug which contributes to this bug.
The channels for sending signal and notifying child exit are different and there is no synchronization among them.
So the reorder can happen. Child processes IPC_PID_KILL for kill(SIGUSR1) and then IPC_CLD_EXIT for exit(). But parent can process them in reverse order.
I'm not sure about how POSIX defines such corner case. I guess it's allowed for signal delivery to delay. But in practical such a delayed signal delivery isn't assumed, I guess.
I see. Thanks for explanation!
Apart from any particular bug there, do you have recommendation on how to detect subtle bugs, which won't immediately cause the testsuite to fail, but introduce flakiness?
We could run the testsuite multiple times, but currently this is not done, because this would take too much time (both real time and Jenkins' CPU time). Instead, maintainers sometimes feel like they should do it manually and just write "[J]enkins retest please" 5 times or so.
@woju Can we have a Jenkins job that runs every night (by night I mean some US time) for ~8 hours and reports how many tests failed? It should also show what PRs were merged from the previous run (the previous night). It won't be proactive but at least retroactively we'll have good understanding if some PRs introduced flakiness.
@dimakuv
I didn't get this part:
heartbeat_handler
tries to derrefresults
and receivesSIGSEGV
.What happens next? Does this LTP test fail? This is native, without any Graphene, right?
No, I only describe what happens when running on Graphene's 'Linux' host. The first part is what happens in LTP code and the second part what happens in LibOS code. Of course in reality they are interleaved as can be seen in the GDB backtrace.
If LibOS wouldn't loop (I think trying to continue execution at the same instruction after a segfault is clearly a bug), yes then that LTP testcase would crash. I don't know if this is a bug in LTP or if LibOS should never allow this ordering of waiting for the child and delivering of SIGUSR1. I didn't test it, but since LTP is created to stress test Linux I assumed that this bug isn't triggered on a native Linux.
@woju:
Apart from any particular bug there, do you have recommendation on how to detect subtle bugs, which won't immediately cause the testsuite to fail, but introduce flakiness?
As @dimakuv already proposed I think a nightly job (or even weekly would help) that loops as often as reasonably possible is a good solution. This not only allows spotting regressions but also helps to asses flaky results on PRs (i.e. If test X has failed for PR123 but succeeded on retry, you can look at the recent nightlies to asses if this is likely to be related to the PR).
@dimakuv We could, but depending on how much time the run would take we would have trouble to schedule it so it happens during the night in both our time zones.
We have 4 important pipelines ({16.04, 18.04} * {no sgx, sgx} and pick one of {no debug, debug}), 5 testsuite loops, each 30 min. Say it takes 5 hours on two concurrent Jenkins runners. So if we'd like this to be finished by, say, 09:00 in any time zone, the worst case this is 09:00 CEST (UTC+2), so this would have to start 02:00 UTC, which is, still in worst case, 18:00 PST. Now this worst case happens only for one week in autumn and two or three weeks in sprint, but nonetheless is a real problem. We could push this a few hours forward, so it would finish by, say 11:00 CEST, but this would still be 20:00 PST, and I think I saw you working at that time more than once.
NA standard -> summer: 2nd Sunday in March
EU standard -> summer: last Sunday in March
EU summer -> standard: last Sunday in October
NA summer -> standard: 1st Sunday in November
PST PDT AST ADT UTC CET CEST
16:00 17:00 20:00 21:00 00:00 01:00 02:00
17:00 18:00 21:00 22:00 01:00 02:00 03:00
18:00 19:00 22:00 23:00 02:00 03:00 04:00
19:00 20:00 23:00 00:00 03:00 04:00 05:00
20:00 21:00 00:00 01:00 04:00 05:00 06:00
21:00 22:00 01:00 02:00 05:00 06:00 07:00
22:00 23:00 02:00 03:00 06:00 07:00 08:00
23:00 00:00 03:00 04:00 07:00 08:00 09:00
00:00 01:00 04:00 05:00 08:00 09:00 10:00
01:00 02:00 05:00 06:00 09:00 10:00 11:00
02:00 03:00 06:00 07:00 10:00 11:00 12:00
03:00 04:00 07:00 08:00 11:00 12:00 13:00
04:00 05:00 08:00 09:00 12:00 13:00 14:00
05:00 06:00 09:00 10:00 13:00 14:00 15:00
06:00 07:00 10:00 11:00 14:00 15:00 16:00
07:00 08:00 11:00 12:00 15:00 16:00 17:00
08:00 09:00 12:00 13:00 16:00 17:00 18:00
09:00 10:00 13:00 14:00 17:00 18:00 19:00
10:00 11:00 14:00 15:00 18:00 19:00 20:00
11:00 12:00 15:00 16:00 19:00 20:00 21:00
12:00 13:00 16:00 17:00 20:00 21:00 22:00
13:00 14:00 17:00 18:00 21:00 22:00 23:00
14:00 15:00 18:00 19:00 22:00 23:00 00:00
15:00 16:00 19:00 20:00 23:00 00:00 01:00
Let's move the testing discussion into #1298.
and In the next respin it will address avoid nesting of fatal signal.
tst_test.c uses SA_RESTART for sigaction for SIGUSR1 If waitpid returns EINTR, SAFE_WAITPID causes test failure.
On the other hande, LibOS doesn't emulate SA_RESTART, but ignores the flag.
@yamahata:
1218 addresses non-delivery of sigal after exit
and In the next respin it will address avoid nesting of fatal signal.
Did you pushed your updated version somewhere? AFAICS you didn't pushed anything new to #1218.
Hmm, given that #1218 is already quite big, how about putting this into a separate PR? (If you need your changes from #1218, make that PR depend on #1218)
I haven't updated #1218 yet. +1 for new PR.
I would hope that the situation has improved over the last few months. However, I also see a few errors when I run ltp inside a Xen VM:
./runltp_xml.py -c ltp-bug-1248.cfg -c ltp.cfg /root/graphene.upstream/LibOS/shim/test/ltp//opt/ltp/runtest/syscalls > ltp.xml
2020-06-23 18:04:32,414 LTP.splice02: -> SKIP (invalid shell command)
2020-06-23 18:04:37,804 LTP.dup07: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1, 2, 3] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:04:38,923 LTP.dup202: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1, 2, 3] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:05:06,375 LTP.fcntl28: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:05:06,416 LTP.fcntl28_64: -> ERROR (must-pass is unneeded, remove it from config (FAILED=[] NOTFOUND=[] passed=[1] dontcare=[] skipped=[] returncode=0))
2020-06-23 18:05:55,935 LTP.clock_nanosleep02: -> SKIP (all subtests skipped (FAILED=[] NOTFOUND=[] passed=[] dontcare=[1, 2, 3, 4, 5, 6, 7] skipped=[] returncode=0))
2020-06-23 18:05:55,940 LTP: LTP finished tests=952 failures=0 errors=4 skipped=895 returncode=4
Makefile:50: recipe for target 'ltp.xml' failed
make: *** [ltp.xml] Error 4
This report is quite consistent between runs. Each of the failing test cases, when run individually, easily passes, though.
It would be good to finally debug and fix LTP tests, but no time for this...
This Xen VM may be bad... With #1617 I now get this here on Fedora 31:
/contrib/conf_lint.py < ltp.cfg
./runltp_xml.py -c ltp-bug-1248.cfg -c ltp.cfg /home/stefanb/tmp/graphene.upstream/LibOS/shim/test/ltp//opt/ltp/runtest/syscalls > ltp.xml
2020-06-24 13:46:59,443 LTP.splice02: -> SKIP (invalid shell command)
2020-06-24 13:47:08,928 LTP.clock_nanosleep02: -> SKIP (all subtests skipped (FAILED=[] NOTFOUND=[] passed=[] dontcare=[1, 2, 3, 4, 5, 6, 7] skipped=[] returncode=0))
2020-06-24 13:47:35,913 LTP: LTP finished tests=951 failures=0 errors=0 skipped=894 returncode=0
All clean!
We haven't seen any of these timeouts in CI after merging #1949 almost two weeks ago, so, seems we can finally close this issue :partying_face:
With current master (d379bba600e18dbdcfb9544916a223ebc35ea5bd) LTP is flaky for me. And it seems I'm much more (un)lucky than Jenkins.
Beside the timeouts I describe here I have seen some other failures with at a much lower rate (
FAIL
ofepoll_wait02
andselect04
and timeout+Python exception forLTP.kill12
), which I will ignore here.Typical output:
(The
SKIP
s are normal)It seems not to be test specific (statistic over 27 runs):
Reproducer (at lest on my test machine):
make -j8
kernel build (machine has 2 cores / 4 threads).../../../../pal_loader setrlimit02
a few times.It will either exit cleanly or printing
signal queue is full (TID = 1, SIG = 11)
as fast as it can. My current understanding is that there are two parts here (this is puzzled together from log + gdb + reading source, there are likely some errors or missing details):In LTP's test case handling (
LibOS/shim/test/apps/ltp/src/lib/tst_test.c
):heartbeat
at the end of the test run.heartbeat
sendsSIGUSR1
to parent.do_cleanup
(called from do_exit) which unmapsresults
.SIGUSR1
and runsheartbeat_handler
.heartbeat_handler
tries to derrefresults
and receivesSIGSEGV
.LibOS handling:
terminate_ipc_helper
it callsDkThreadDelayExecution
.DkThreadDelayExecution
__check_pending_event
sees theSIGUSR1
.resume_upcall
gets called.resume_upcall
increasespreempt
to 1 and calls__handle_signal
.handle_one_signal
callshearbeat_handler
which segfaults (see above).deliver_signal
.deliver_signal
increasespreempt
and since it's> 1
now it queues theSIGSEGV
and returns.hearbeat_handler
segaults again and we repeat this loop forever and after a few rounds we start printing that the queue is full.GDB backtrace
Since GDB didn't like it when running the binary directly this is after attaching to the already looping process: ``` (gdb) bt #0 char_write (handle=So there are 2 questions: