tarantool / test-run

Tarantool functional testing framework
14 stars 15 forks source link

test-run left running subprocesses alive after exiting by timeout #382

Closed ligurio closed 1 year ago

ligurio commented 1 year ago

Steps to reproduce:

Expected result

test-run exits by timeout, because "--no-output-timeout" is less than test duration, and kills all subprocesses

Actual result

test-run exits by timeout and left luatest process alive

See log:

[1] ~/sources/MRG/tarantool$ cat test/box-luatest/sample_test.lua 
local t = require('luatest')
local g = t.group('test-run')

g.test_example_1 = function() os.execute("sleep 10") end

[1] ~/sources/MRG/tarantool$ luatest -c test/box-luatest/sample_test.lua 
Tarantool version is 2.11.0-rc2-0-g2ae0c94a2
Running with --shuffle group:6896
.
Ran 1 tests in 10.003 seconds, 1 succeeded, 0 failed

[0] ~/sources/MRG/tarantool$ ./test/test-run.py --debug --verbose --no-output-timeout=15 box-luatest/sample_test.lua                        

<snipped>

WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
No output during 10 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
No output during 20 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
Test hung! Result content mismatch:
[File does not exist: box-luatest/sample.result]

[Main process] No output from workers. It seems that we hang. Send SIGKILL to workers; exiting...
---------------------------------------------------------------------------------
Top 10 tests by occupied memory (RSS, Mb):
*   19.8 box-luatest/sample_test.lua                                  

(Tests quicker than 0.1 seconds may be missed.)

---------------------------------------------------------------------------------
Top 10 longest tests (seconds):
---------------------------------------------------------------------------------
* undone: 1
[0] ~/sources/MRG/tarantool$ ps ax | grep tarantool
3238276 pts/6    Sl     0:00 tarantool /home/sergeyb/sources/MRG/tarantool/test-run/lib/luatest/bin/luatest -c -v test/box-luatest/sample_test.lua -o tap --shuffle none
3238301 pts/6    S+     0:00 grep tarantool

Bug was found on debugging etcd-client tests in https://github.com/tarantool/tarantool-ee/pull/444.

Totktonada commented 1 year ago

I looked over the luatest_server.py code and it seems the root of the problem is that it doesn't follow --test-timeout. We should add a timeout to the .communicate() call, when test-run is launched under Python 3 (compare: py2, py3).

(Another option is to use threading.Timer like in app_server.py.)

The same change has sense for unittest_server.py too.

How test-run handles different timeouts test-run has a carcade of increasing timeouts. * Timeouts for activities within a test. * Server stop timeout (hardcoded to 5 seconds after SIGTERM, then SIGKILL and wait for termination). Applicable for `tarantool` and `app` tests[^1], which start utility servers. * `--server-start-timeout` (90 seconds). Applicable for `tarantool` and `app` tests. * `--test-timeout` (110 seconds). Applicable for `tarantool`, `app` tests. * `--no-output-timeout` (120 seconds). It 'stops the world', so we can say that it is applicable for all test types. The idea here is to catch a problem as close as possible to its root. `--no-output-timeout` is a kind of heuristical -- if nothing occurs during two minutes, something is likely going wrong. This is the last resort timeout. It means that test-run is constructed with an idea that it should never occur. Hitting `--no-output-timeout` likely means a problem in test-run itself. In the given case the problem is lack of `--test-timeout` handling in `luatest_server.py`. Since the `--no-output-timeout` is the 'never occur' thing we just do what is possible -- kill the workers and exit with a report. It seems move valuable to fix the cause of hitting `--no-output-timeout` rather than working on improving its actions. However I have an idea here, which shouldn't be sophisticated: we can move each worker into its own process group/session and kill the entire group at hitting the timeout. However, it has its own consequences: shell's job control will not see the workers, so, for example, Ctrl+Z will not pause them. (It can be handled by re-sending a signal from the main process to the workers, though.) [^1]: There is a description of test-run's test types in https://github.com/tarantool/luatest/issues/191.
Totktonada commented 1 year ago

I proposed a fix in PR #384. Sergey, please, take a look.

Remember to add something like --test-timeout=10 to your reproducer (and change sleep 10 to, say, sleep 25).