test-run left running subprocesses alive after exiting by timeout

ligurio commented 1 year ago

Steps to reproduce:

create a test in luatest format (I suspect the same will happen with other supported tests)
our new test requires 25 sec for running
run test with luatest, everything is ok
run test with test-run and set timeout "--no-output-timeout" to 15 sec

Expected result

test-run exits by timeout, because "--no-output-timeout" is less than test duration, and kills all subprocesses

Actual result

test-run exits by timeout and left luatest process alive

See log:

[1] ~/sources/MRG/tarantool$ cat test/box-luatest/sample_test.lua 
local t = require('luatest')
local g = t.group('test-run')

g.test_example_1 = function() os.execute("sleep 10") end

[1] ~/sources/MRG/tarantool$ luatest -c test/box-luatest/sample_test.lua 
Tarantool version is 2.11.0-rc2-0-g2ae0c94a2
Running with --shuffle group:6896
.
Ran 1 tests in 10.003 seconds, 1 succeeded, 0 failed

[0] ~/sources/MRG/tarantool$ ./test/test-run.py --debug --verbose --no-output-timeout=15 box-luatest/sample_test.lua                        

<snipped>

WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
No output during 10 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
No output during 20 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
Test hung! Result content mismatch:
[File does not exist: box-luatest/sample.result]

[Main process] No output from workers. It seems that we hang. Send SIGKILL to workers; exiting...
---------------------------------------------------------------------------------
Top 10 tests by occupied memory (RSS, Mb):
*   19.8 box-luatest/sample_test.lua                                  

(Tests quicker than 0.1 seconds may be missed.)

---------------------------------------------------------------------------------
Top 10 longest tests (seconds):
---------------------------------------------------------------------------------
* undone: 1
[0] ~/sources/MRG/tarantool$ ps ax | grep tarantool
3238276 pts/6    Sl     0:00 tarantool /home/sergeyb/sources/MRG/tarantool/test-run/lib/luatest/bin/luatest -c -v test/box-luatest/sample_test.lua -o tap --shuffle none
3238301 pts/6    S+     0:00 grep tarantool

Bug was found on debugging etcd-client tests in https://github.com/tarantool/tarantool-ee/pull/444.

Totktonada commented 1 year ago

I looked over the luatest_server.py code and it seems the root of the problem is that it doesn't follow --test-timeout. We should add a timeout to the .communicate() call, when test-run is launched under Python 3 (compare: py2, py3).

(Another option is to use threading.Timer like in app_server.py.)

The same change has sense for unittest_server.py too.

How test-run handles different timeouts

test-run has a carcade of increasing timeouts. * Timeouts for activities within a test. * Server stop timeout (hardcoded to 5 seconds after SIGTERM, then SIGKILL and wait for termination). Applicable for `tarantool` and `app` tests[^1], which start utility servers. * `--server-start-timeout` (90 seconds). Applicable for `tarantool` and `app` tests. * `--test-timeout` (110 seconds). Applicable for `tarantool`, `app` tests. * `--no-output-timeout` (120 seconds). It 'stops the world', so we can say that it is applicable for all test types. The idea here is to catch a problem as close as possible to its root. `--no-output-timeout` is a kind of heuristical -- if nothing occurs during two minutes, something is likely going wrong. This is the last resort timeout. It means that test-run is constructed with an idea that it should never occur. Hitting `--no-output-timeout` likely means a problem in test-run itself. In the given case the problem is lack of `--test-timeout` handling in `luatest_server.py`. Since the `--no-output-timeout` is the 'never occur' thing we just do what is possible -- kill the workers and exit with a report. It seems move valuable to fix the cause of hitting `--no-output-timeout` rather than working on improving its actions. However I have an idea here, which shouldn't be sophisticated: we can move each worker into its own process group/session and kill the entire group at hitting the timeout. However, it has its own consequences: shell's job control will not see the workers, so, for example, Ctrl+Z will not pause them. (It can be handled by re-sending a signal from the main process to the workers, though.) [^1]: There is a description of test-run's test types in https://github.com/tarantool/luatest/issues/191.

Totktonada commented 1 year ago

I proposed a fix in PR #384. Sergey, please, take a look.

Remember to add something like --test-timeout=10 to your reproducer (and change sleep 10 to, say, sleep 25).

tarantool / test-run