pytest hangs while running tests

telles-simbiose commented 7 years ago

Hello everyone, Last night I let my test suite running until this morning, but I noticed that it hasn't finished running all tests, looking at htop, I noticed some strange processes running for a really long time, as shown in this screenshot:

Looking at the tests output, I saw that the last ran tests were all ran by the same worker gw2 (there were 4 workers running), as there were 3 processes import sys;exec(eval(sys.stdin.readline())) running for 13+ hours, I think that those 3 workers were just stuck somehow.

RonnyPfannschmidt commented 7 years ago

those processes are indeed xdist slaves, however with the amount of information you provided we are entirely unable to help

Daenyth commented 7 years ago

I don't know if it's related but I've seen parallel runs with xdist hang for a long time on my Jenkins with the most recent output being scheduling tests via LoadScheduling

telles-simbiose commented 7 years ago

@RonnyPfannschmidt , what kind of information should I provide? It happened only once, I was not able to reproduce it a second time

RonnyPfannschmidt commented 7 years ago

@telles-simbiose all involved packages and versions, also it would help if we could take a look at the testsuite

deadlocks triggered by rare race conditions are not uncommon in distributed systems, and xdist running slaves is a distributed system

pseudotensor commented 6 years ago

We commonly (about 1/20 runs) get this issue on jenkins and local machines.

15:56:10 ============================= test session starts ==============================
15:56:10 platform linux -- Python 3.6.4, pytest-3.3.2, py-1.5.2, pluggy-0.6.0 -- /****_env/bin/python3.6
15:56:10 cachedir: .cache
15:56:10 rootdir: /opt/****, inifile: setup.cfg
15:56:10 plugins: xdist-1.22.0, forked-0.2, cov-2.4.0
15:56:10 gw0 I / gw1 I / gw2 I / gw3 I / gw4 I / gw5 I / gw6 I / gw7 I / gw8 I / gw9 I / gw10 I / gw11 I / gw12 I / gw13 I / gw14 I / gw15 I / gw16 I / gw17 I / gw18 I / gw19 I / gw20 I / gw21 I / gw22 I / gw23 I / gw24 I / gw25 I / gw26 I / gw27 I / gw28 I / gw29 I / gw30 I / gw31 I / gw32 I / gw33 I / gw34 I / gw35 I / gw36 I / gw37 I / gw38 I / gw39 I
15:56:10 
[gw0] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw1] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw2] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw3] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw4] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw5] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw6] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw7] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw8] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw9] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw10] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw11] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw12] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw13] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw14] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw15] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw16] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw17] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw18] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw19] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw20] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw21] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw22] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw23] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw24] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw25] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw26] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw27] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw28] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw29] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw30] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw31] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw32] linux Python 3.6.4 cwd: /opt/****
15:56:16 
[gw33] linux Python 3.6.4 cwd: /opt/****
15:56:17 
[gw34] linux Python 3.6.4 cwd: /opt/****
15:56:17 
[gw35] linux Python 3.6.4 cwd: /opt/****
15:56:17 
[gw36] linux Python 3.6.4 cwd: /opt/****
15:56:18 
[gw37] linux Python 3.6.4 cwd: /opt/****
15:56:18 
[gw38] linux Python 3.6.4 cwd: /opt/****
15:56:19 
[gw39] linux Python 3.6.4 cwd: /opt/****
15:56:19 
[gw0] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw1] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw2] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw3] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw4] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw5] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw6] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw8] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw7] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw9] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw10] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw11] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw12] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw13] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw14] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw15] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw17] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw16] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw18] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw19] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw20] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw21] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw23] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw22] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw24] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw25] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw27] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw26] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw28] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw29] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw30] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw31] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw32] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw33] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw34] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw35] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw36] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw37] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw38] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:20 
[gw39] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
17:09:46 Cancelling nested steps due to timeout
17:09:46 Sending interrupt signal to process
17:09:49 Makefile:240: recipe for target 'test_integration2' failed
17:09:49 make: *** [test_integration2] Terminated
17:09:49 Terminated
[Pipeline] }
[Pipeline] // timeout
[Pipeline] sh
17:09:49 [dai-tests-2gpu_dev-NUMSKB2Q3XX56UBUAOGHD33FZR7UOQYGSSWMHKM3RXLMS6ITB5GQ] Running shell script
17:09:49 + echo Jenkins job ABORTED due to exception, non-zero errorcode, or Jenkins timeout: script returned exit code 143
[Pipeline] sh
17:09:49 [tests-2gpu_dev-NUMSKB2Q3XX56UBUAOGHD33FZR7UOQYGSSWMHKM3RXLMS6ITB5GQ] Running shell script
17:09:49 + mkdir -p build/test-reports
17:09:49 + echo <?xml version="1.0" encoding="utf-8"?>
17:09:49 <testsuite errors="0" failures="1" name="pytest" skips="0" tests="0" time="0">
17:09:49 <testcase classname="tests" file="tests" name="tests" time="0"><failure message="JOB WAS KILLED BECAUSE Exception">script returned exit code 143</failure></testcase>
17:09:49 </testsuite>

enkins  22406  0.0  0.0   1148     4 ?        Ss   01:30   0:00      |   \_ /dev/init -- ./run.sh /opt/h2oai make test
jenkins   2536  0.0  0.0  19716  3156 ?        S    01:32   0:00      |           \_ /bin/bash -c pytest --color=yes --durations=10 -s -v --fulltrace --full-trace --junit-xml=build/test-reports/unit-test_test_models.xml -n auto tests/test_models 2> ./tmp/test_models.29966_2018.03.05-09:32:12
jenkins   2537  0.0  0.0 3063628 29844 ?       Sl   01:32   0:01      |               \_ /h2oai_env/bin/python3.6 /env/bin/pytest --color=yes --durations=10 -s -v --fulltrace --full-trace --junit-xml=build/test-reports/unit-test_test_models.xml -n auto tests/test_models
jenkins   2541  0.2  0.1 3606224 161340 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2544  0.2  0.1 3604924 159784 ?      Sl   01:32   0:10      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2547  0.1  0.1 3604840 159948 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2551  0.1  0.1 3604924 159984 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2555  0.1  0.1 3604864 160076 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2559  0.1  0.1 3604928 159904 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2567  0.1  0.1 3604928 159876 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2574  0.1  0.1 3604928 160152 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2580  0.2  0.1 3604852 160100 ?      Sl   01:32   0:10      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2636  0.1  0.1 3604928 159868 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2669  0.1  0.1 3604832 160048 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2735  0.1  0.1 3604844 159968 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2780  0.1  0.1 3651120 160128 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2825  0.1  0.1 3604872 159976 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2871  0.1  0.1 3604832 160176 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2914  0.1  0.1 3604840 159736 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2959  0.1  0.1 3604848 159996 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3006  0.1  0.1 3604920 159820 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3012  0.1  0.1 3604852 159868 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3059  0.1  0.1 3604852 160052 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3104  0.1  0.1 3604940 160652 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3186  0.1  0.1 3604836 160056 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3284  0.1  0.1 3604824 160316 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3472  0.1  0.1 3604876 160136 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3494  0.1  0.1 3604852 159680 ?      Sl   01:32   0:05      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3623  0.1  0.0 3382212 122060 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   7939  0.0  0.0 3382232 75236 ?       Sl   01:32   0:00      |                   |   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   7962  0.0  0.0 3388608 76856 ?       S    01:32   0:00      |                   |       \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3890  0.1  0.1 3604884 160320 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4064  0.1  0.1 3604936 160388 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4162  0.1  0.1 3604884 160260 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4253  0.1  0.1 3604840 160472 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4485  0.1  0.1 3604840 159932 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4879  0.1  0.1 3604868 160344 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4914  0.1  0.1 3604836 160148 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   5151  0.1  0.1 3604836 160108 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   5505  0.1  0.1 3604928 160212 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   5728  0.1  0.1 3604924 160408 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6014  0.1  0.1 3604964 160328 ?      Sl   01:32   0:05      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6205  0.1  0.1 3604836 160404 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6330  0.1  0.1 3604928 160224 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6497  0.1  0.1 3604848 159980 ?      Sl   01:32   0:09      |                   \_ /env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))

s1113950 commented 6 years ago

We see a similar issue here:

py2-e2e runtests: commands[0] | py.test -vv -n 8 tests/e2e
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.7.0, py-1.5.4, pluggy-0.7.1 -- /usr/share/mlt/.venv/bin/python2
cachedir: .pytest_cache
rootdir: /usr/share/mlt, inifile: tox.ini
plugins: xdist-1.22.5, forked-0.2, cov-2.5.1
[gw0] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw1] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw2] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw3] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw4] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw5] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw6] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw7] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw0] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw2] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw1] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw4] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw3] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw7] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw5] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw6] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]   
gw0 [20] / gw1 [20] / gw2 [20] / gw3 [20] / gw4 [20] / gw5 [20] / gw6 [20] / gw7 [20]
scheduling tests via LoadScheduling

tests/e2e/test_config_updates.py::TestConfig::test_add_remove_config 
tests/e2e/test_config_updates.py::TestConfig::test_update_config 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[hello-world] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[experiments] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch-distributed] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-distributed] 
tests/e2e/test_config_updates.py::TestConfig::test_config_list 
[gw6] [  5%] PASSED tests/e2e/test_config_updates.py::TestConfig::test_config_list 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-dist-mnist] 
[gw4] [ 10%] PASSED tests/e2e/test_config_updates.py::TestConfig::test_add_remove_config 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[horovod] 
[gw3] [ 15%] PASSED tests/e2e/test_config_updates.py::TestConfig::test_update_config 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tensorboard] 
[gw1] [ 20%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[hello-world] 
[gw5] [ 25%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-distributed] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_enable_sync 
[gw3] [ 30%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tensorboard] 
[gw5] [ 35%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_enable_sync 
[gw7] [ 40%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch] 
[gw2] [ 45%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[experiments] 
[gw0] [ 50%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch-distributed] 
[gw4] [ 55%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[horovod] 
[gw6] [ 60%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-dist-mnist] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_debug_wrapper 
tests/e2e/test_templates.py::TestTemplates::test_templates 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[hello-world] 
tests/e2e/test_templates.py::TestTemplates::test_local_templates 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[tf-distributed] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_check_logs 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_no_push_deploy 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_watch_build_and_deploy_no_push 
[gw3] [ 65%] PASSED tests/e2e/test_templates.py::TestTemplates::test_templates 
[gw0] [ 70%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[tf-distributed] 
[gw7] [ 75%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_check_logs 
[gw5] [ 80%] PASSED tests/e2e/test_templates.py::TestTemplates::test_local_templates 
[gw2] [ 85%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[hello-world] 
[gw6] [ 90%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_watch_build_and_deploy_no_push 
[gw1] [ 95%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_no_push_deploy 
[gw4] [100%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_debug_wrapper Makefile:177: recipe for target 'test-e2e-all-circleci' failed
make: *** [test-e2e-all-circleci] Terminated
Too long with no output (exceeded 20m0s)

Where tests seem to finish but then still hang. In this specific case, python2 tests seemed to finish but hung and then python3 tests never started (we use tox to trigger tests).

BlackHobbiT commented 4 years ago

Yeah, suffered from this issue sometimes. --fulltrace provide lock here platform linux -- Python 3.6.1, pytest-5.0.1, py-1.5.3, pluggy-0.12.0 -- /usr/bin/python3.6 cachedir: .pytest_cache rootdir: /home/BlackHobbiT/path/to/test inifile: pytest.ini plugins: forked-1.0.2, xdist-1.29.0, allure-pytest-2.7.0 [gw0] linux Python 3.6.1 cwd: /home/BlackHobbiT/path/to/test [gw0] Python 3.6.1 (default, Sep 7 2017, 16:36:03) -- [GCC 6.3.0 20170406]


config = <_pytest.config.Config object at 0x7f1cac7e4da0>, doit = <function _main at 0x7f1caccb0f28>

    def wrap_session(config, doit):
        """Skeleton command line program"""
        session = Session(config)
        session.exitstatus = ExitCode.OK
        initstate = 0
        try:
            try:
                config._do_configure()
                initstate = 1
                config.hook.pytest_sessionstart(session=session)
                initstate = 2
>               session.exitstatus = doit(config, session) or 0

/usr/local/lib/python3.6/dist-packages/_pytest/main.py:213: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = <_pytest.config.Config object at 0x7f1cac7e4da0>, session = <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>

    def _main(config, session):
        """ default command line protocol for initialization, session,
        running tests and reporting. """
        config.hook.pytest_collection(session=session)
>       config.hook.pytest_runtestloop(session=session)

/usr/local/lib/python3.6/dist-packages/_pytest/main.py:257: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtestloop'>, args = (), kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}
notincall = set()

    def __call__(self, *args, **kwargs):
        if args:
            raise TypeError("hook calling supports only keyword arguments")
        assert not self.is_historic()
        if self.spec and self.spec.argnames:
            notincall = (
                set(self.spec.argnames) - set(["__multicall__"]) - set(kwargs.keys())
            )
            if notincall:
                warnings.warn(
                    "Argument(s) {} which are declared in the hookspec "
                    "can not be found in this hook call".format(tuple(notincall)),
                    stacklevel=2,
                )
>       return self._hookexec(self, self.get_hookimpls(), kwargs)

/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py:289: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f1caee2ab00>, hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/usr/local/lib/python3.6/dist-packages/_pytest/main...1cab5f7f98>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f1cab5ce2b0>>]
kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}

    def _hookexec(self, hook, methods, kwargs):
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook, methods, kwargs)

/usr/local/lib/python3.6/dist-packages/pluggy/manager.py:87: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/usr/local/lib/python3.6/dist-packages/_pytest/main...1cab5f7f98>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f1cab5ce2b0>>]
kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}

    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
        methods,
        kwargs,
>       firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
    )

/usr/local/lib/python3.6/dist-packages/pluggy/manager.py:81: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f1cab5f7f98>

    def pytest_runtestloop(self):
        self.sched = self.config.hook.pytest_xdist_make_scheduler(
            config=self.config, log=self.log
        )
        assert self.sched is not None

        self.shouldstop = False
        while not self.session_finished:
>           self.loop_once()

/usr/local/lib/python3.6/dist-packages/xdist/dsession.py:115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f1cab5f7f98>

    def loop_once(self):
        """Process one callback from one of the workers."""
        while 1:
            if not self._active_nodes:
                # If everything has died stop looping
                self.triggershutdown()
                raise RuntimeError("Unexpectedly no active workers available")
            try:
>               eventcall = self.queue.get(timeout=2.0)

/usr/local/lib/python3.6/dist-packages/xdist/dsession.py:129: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <queue.Queue object at 0x7f1cab5f76d8>, block = True, timeout = 2.0

    def get(self, block=True, timeout=None):
        '''Remove and return an item from the queue.

        If optional args 'block' is true and 'timeout' is None (the default),
        block if necessary until an item is available. If 'timeout' is
        a non-negative number, it blocks at most 'timeout' seconds and raises
        the Empty exception if no item was available within that time.
        Otherwise ('block' is false), return an item if one is immediately
        available, else raise the Empty exception ('timeout' is ignored
        in that case).
        '''
        with self.not_empty:
            if not block:
                if not self._qsize():
                    raise Empty
            elif timeout is None:
                while not self._qsize():
                    self.not_empty.wait()
            elif timeout < 0:
                raise ValueError("'timeout' must be a non-negative number")
            else:
                endtime = time() + timeout
                while not self._qsize():
                    remaining = endtime - time()
                    if remaining <= 0.0:
                        raise Empty
>                   self.not_empty.wait(remaining)

/usr/lib/python3.6/queue.py:173: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Condition(<unlocked _thread.lock object at 0x7f1cab4f2120>, 0)>, timeout = 1.9999943980947137

    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.

        If the calling thread has not acquired the lock when this method is
        called, a RuntimeError is raised.

        This method releases the underlying lock, and then blocks until it is
        awakened by a notify() or notify_all() call for the same condition
        variable in another thread, or until the optional timeout occurs. Once
        awakened or timed out, it re-acquires the lock and returns.

        When the timeout argument is present and not None, it should be a
        floating point number specifying a timeout for the operation in seconds
        (or fractions thereof).

        When the underlying lock is an RLock, it is not released using its
        release() method, since this may not actually unlock the lock when it
        was acquired multiple times recursively. Instead, an internal interface
        of the RLock class is used, which really unlocks it even when it has
        been recursively acquired several times. Another internal interface is
        then used to restore the recursion level when the lock is reacquired.

        """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
>                   gotit = waiter.acquire(True, timeout)
E                   KeyboardInterrupt

/usr/lib/python3.6/threading.py:299: KeyboardInterrupt

mavo123 commented 4 years ago

Try to use pytest-timeout, pytest --timeout=, this will kill existing hang thread and move the execution for you.

tamaskakuszi commented 3 years ago

Hey @telles-simbiose and @BlackHobbiT,

Did you manage to make it work? We suffer from the same issue. The test run hangs on 93%. All the workers are busy, only killing the process in task manager solves it, meaning let the run continue. When that worker is crushed, the report is also lost from that specific test.

Thanks

vishakha-vonage commented 3 years ago

I'm facing a very weird issue. When I run my test in parallel- Only 1 browser opens even when I execute pytest -n 2 After first test is successful, pytest just hangs and it does not even timeout Has anyone faced this issue?

py.test -vv -n 2 =================================================================================================== test session starts =================================================================================================== platform darwin -- Python 3.7.3, pytest-5.4.1, py-1.10.0, pluggy-0.13.1 -- ///* cachedir: .pytest_cache rootdir: //*, inifile: pytest.ini plugins: xdist-1.31.0, allure-pytest-2.8.40, repeat-0.9.1, forked-1.1.3, timeout-1.4.2 [gw0] darwin Python 3.7.3 cwd: /*/ [gw1] darwin Python 3.7.3 cwd: //* collected 8 items
[gw0] Python 3.7.3 (default, Apr 24 2020, 18:51:23) -- [Clang 11.0.3 (clang-1103.0.32.62)] [gw1] Python 3.7.3 (default, Apr 24 2020, 18:51:23) -- [Clang 11.0.3 (clang-1103.0.32.62)] gw0 [8] / gw1 [8] scheduling tests via MarkBasedScheduler

test1 [gw1] [ 12%] PASSED test1

It will get stuck after this.

BlackHobbiT commented 2 years ago

@tamaskakuszi as far I remember, wiping pycache dirs sometimes helps.

JacobCallahan commented 1 year ago

@RonnyPfannschmidt we're also seeing this intermittently. If you'd like access to one of our environments, I can make that happen. Thanks!

RonnyPfannschmidt commented 1 year ago

@JacobCallahan shot me more details at the work channel

Bruniz commented 1 year ago

Any updated about this topic @RonnyPfannschmidt @JacobCallahan?

We are facing this in the CI of the company I work. We use xdist to run the tests in parallel and it seems this happens from time to time when fail fast is enabled and the test session is aborted. We then have a zombie process that is stuck running this command https://github.com/pytest-dev/execnet/blame/d7ca9815734a4efb168c3ef997858e38c040fc70/execnet/gateway_io.py#L58 as far as I can tell. It would make sense as we are using xdist.

I don't really understand what this line is supposed to do but it looks like some old workaround possibly?

I could also create an issue in execnet as well if that is of any use.

RonnyPfannschmidt commented 1 year ago

this line bootstraps execnet, the rest is fed as commend in stdio

Bruniz commented 1 year ago

Okay, any idea why processes can be left hanging on that command? There is this command seen in pstree and then it's waiting to read a file descriptor, I guess stdin, but nothing is being written there by any process.

RonnyPfannschmidt commented 1 year ago

That seems like the control process died and the worker is waiting for the shutdown command, fetching a stack trace with gdb is only partially helpful as the io is being handled multithreaded, and the state of the worker is unclear

Bruniz commented 1 year ago

Thanks for your time Ronny, I tried digging around with gdb but I only basically found a reference back to the code I mentioned. The reset of the trace was in C so it went a bit over my head. Is the interrupt from pytest then possibly not handled correctly. We only use the built in fail fast in pytest and don't kill anything extra.

RonnyPfannschmidt commented 1 year ago

@Bruniz thats unclear, its entirely possible our suite is hanging somewhere in c and the shut-down isn't reaching it with the information provided i'm unable to make a educated guess

Bruniz commented 1 year ago

Thanks for your time Ronny, I tried digging around with gdb but I only basically found a reference back to the code I mentioned. The reset of the trace was in C so it went a bit over my head. Is the interrupt from pytest then possibly not handled correctly. We only use the built in fail fast in pytest and don't kill anything extra.

RonnyPfannschmidt commented 1 year ago

@Bruniz by fail fast, do you mean -x ?

having a paste of command + command output would be a big help

webbnh commented 1 year ago

I'm seeing this problem in our CI now, too.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
jenkins        1  0.0  0.0   1000     4 ?        Ss   02:32   0:00 /run/podman-init -- ./build.sh
jenkins        7  0.0  0.0   4072  3272 ?        S    02:32   0:00 /bin/bash -e ./build.sh
jenkins      227  1.6  0.0 850404 36284 ?        Sl   02:34   0:31 /usr/bin/python3 -P /usr/bin/tox
jenkins      387  0.3  0.0  86652 79016 ?        S    02:35   0:05 /var/tmp/jenkins/tox/.pkg/bin/python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
jenkins      450  0.0  0.0   4204  3320 ?        S    02:35   0:00 /bin/bash /home/jenkins/pbench/exec-tests /var/tmp/jenkins/tox/py39
jenkins      560  0.0  0.0   2332   916 ?        S    02:35   0:00 /usr/bin/time --format=\n\nCommand: '%C'\nExit status: %x\nTimings: user %Us, system %Ss, elapsed %es (%E, %P)\nMemory: max RSS %Mk, minor pf: %R, major pf: %F, swaps %W\nContext switches: inv %c, vol %w, signals %k\nI/O: fs in %I, fs out %O, socket in %r, socket out %s\n pytest --tb=native -n auto --basetemp=/var/tmp/jenkins/tox/py39/tmp --cov=/var/tmp/jenkins/tox/py39/lib/python3.9/site-packages/pbench --cov-report xml:cov/report.xml -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
jenkins      561  2.2  0.0 2587628 241432 ?      Sl   02:35   0:41 /var/tmp/jenkins/tox/py39/bin/python /var/tmp/jenkins/tox/py39/bin/pytest --tb=native -n auto --basetemp=/var/tmp/jenkins/tox/py39/tmp --cov=/var/tmp/jenkins/tox/py39/lib/python3.9/site-packages/pbench --cov-report xml:cov/report.xml -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
jenkins      567  0.5  0.0 307108 149232 ?       Sl   02:35   0:09 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      570  0.4  0.0 299348 140964 ?       Sl   02:35   0:07 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      573  0.5  0.0 308988 150952 ?       Sl   02:35   0:10 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      576  0.4  0.0 302416 144392 ?       Sl   02:35   0:07 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      579  0.5  0.0 311952 153348 ?       Sl   02:35   0:10 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      582  0.5  0.0 305180 149068 ?       Sl   02:35   0:09 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      585  0.6  0.0 313972 155868 ?       Sl   02:35   0:11 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
 [and 25 more worker processes in similar states]

What other information can I provide?

$ /var/tmp/jenkins/tox/py39/bin/python --version
Python 3.9.16
$ /var/tmp/jenkins/tox/py39/bin/pytest --version
pytest 7.4.0

Here's the log leading up to the hang:

py39: install_deps> pip install --cache-dir=/var/tmp/jenkins/tox/cache --progress-bar off --prefix=/var/tmp/jenkins/tox/py39 -r /home/jenkins/pbench/agent/requirements.txt -r /home/jenkins/pbench/agent/test-requirements.txt -r /home/jenkins/pbench/client/requirements.txt -r /home/jenkins/pbench/server/requirements.txt -r /home/jenkins/pbench/server/test-requirements.txt
.pkg: install_requires> python -I -m pip install 'setuptools>=46.1.3' 'wheel>=0.34.2'
.pkg: _optional_hooks> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: get_requires_for_build_sdist> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: install_requires_for_build_sdist> python -I -m pip install pbr
.pkg: prepare_metadata_for_build_wheel> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: build_sdist> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
py39: install_package> pip install --cache-dir=/var/tmp/jenkins/tox/cache --progress-bar off --prefix=/var/tmp/jenkins/tox/py39 --force-reinstall --no-deps /var/tmp/jenkins/tox/.tmp/package/1/pbench-0.0.1.dev2674.tar.gz
py39: commands[0]> bash -c '/home/jenkins/pbench/exec-tests /var/tmp/jenkins/tox/py39 '

Running agent,client,server python3-based unit tests via pytest

Package                  Version
------------------------ -------------
alembic                  1.11.1
aniso8601                9.0.1
appier                   1.21.7
bcrypt                   4.0.1
Bcrypt-Flask             1.0.2
beautifulsoup4           4.9.3
blinker                  1.6.2
boto3                    1.17.97
botocore                 1.20.97
bottle                   0.12.25
bs4                      0.0.1
cachetools               4.2.2
certifi                  2021.5.30
cffi                     1.15.1
chardet                  4.0.0
click                    8.1.6
coverage                 7.2.7
cryptography             41.0.2
docutils                 0.20.1
elasticsearch            7.13.4
exceptiongroup           1.1.2
execnet                  2.0.2
filelock                 3.12.2
Flask                    2.3.2
Flask-Cors               4.0.0
Flask-HTTPAuth           4.8.0
Flask-JWT-Extended       4.5.2
Flask-Migrate            4.0.4
Flask-RESTful            0.3.10
Flask-SQLAlchemy         3.0.5
freezegun                1.2.2
gitdb                    4.0.10
GitPython                3.1.32
google-api               0.1.12
google-api-core          1.30.0
google-api-python-client 2.9.0
google-auth              1.31.0
google-auth-httplib2     0.1.0
google-auth-oauthlib     0.4.4
googleapis-common-protos 1.53.0
greenlet                 2.0.2
gunicorn                 21.2.0
hiredis                  2.2.3
httplib2                 0.19.1
humanize                 4.7.0
idna                     2.10
ifaddr                   0.2.0
importlib-metadata       6.8.0
iniconfig                2.0.0
itsdangerous             2.1.2
Jinja2                   3.1.2
jmespath                 0.10.0
lockfile                 0.12.2
Mako                     1.2.4
MarkupSafe               2.1.3
mock                     5.1.0
oauthlib                 3.1.1
packaging                20.9
pbench                   0.0.1.dev2674
pip                      22.3.1
pluggy                   1.2.0
pquisby                  0.0.12
protobuf                 3.17.3
psutil                   5.9.5
psycopg2                 2.9.6
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycparser                2.21
pyesbulk                 2.1.1
PyJWT                    2.8.0
pyparsing                2.4.7
pytest                   7.4.0
pytest-cov               4.1.0
pytest-dependency        0.5.1
pytest-freezegun         0.4.2
pytest-helpers-namespace 2019.1.8
pytest-mock              3.11.1
pytest-xdist             3.3.1
python-daemon            3.0.1
python-dateutil          2.8.1
python-pidfile           3.1.1
pytz                     2019.1
PyYAML                   6.0.1
redis                    3.5.3
requests                 2.25.1
requests-mock            1.11.0
requests-oauthlib        1.3.0
responses                0.23.1
rsa                      4.7.2
s3transfer               0.4.2
sdnotify                 0.3.2
setuptools               65.5.1
sh                       2.0.4
six                      1.16.0
smmap                    5.0.0
soupsieve                2.2.1
SQLAlchemy               2.0.19
SQLAlchemy-Utils         0.41.1
state-signals            1.0.1
tomli                    2.0.1
types-PyYAML             6.0.12.11
typing_extensions        4.7.1
uritemplate              3.0.1
urllib3                  1.26.5
Werkzeug                 2.3.6
wheel                    0.38.4
zipp                     3.16.2
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.4.0, pluggy-1.2.0
cachedir: /var/tmp/jenkins/tox/py39/.pytest_cache
rootdir: /home/jenkins/pbench
configfile: pytest.ini
plugins: cov-4.1.0, freezegun-0.4.2, helpers-namespace-2019.1.8, requests-mock-1.11.0, mock-3.11.1, xdist-3.3.1, dependency-0.5.1
created: 32/32 workers
32 workers [1521 items]

[and nothing more...]

webbnh commented 1 year ago

Trying the tests in my development environment, they hang too. Since I'm running interactively, there, I get slightly more output from Pytest:

=============== test session starts ================
platform linux -- Python 3.9.16, pytest-7.4.0, pluggy-1.2.0
cachedir: /var/tmp/wscales/tox/py39/.pytest_cache
rootdir: /home/wscales/pbench
configfile: pytest.ini
plugins: requests-mock-1.11.0, xdist-3.3.1, mock-3.11.1, helpers-namespace-2019.1.8, freezegun-0.4.2, dependency-0.5.1, cov-4.1.0
8 workers [1521 items]  
..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ss....s.s.....s.........s..........................s..s...s.....s...s.s............................................................................................................................s..s..s...s......ss..s..s..s......s.s......ss.sss....s......s.....................................................

The ps output looks pretty much the same (only there are 8 workers instead of 32):

$ ps -auxww
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
wscales        1  0.0  0.0   1084     4 pts/0    Ss   04:32   0:00 /run/podman-init -- tox
wscales        2  2.4  0.0 776700 38664 pts/0    Sl+  04:32   0:04 /usr/bin/python3 -P /usr/bin/tox
wscales      162  4.4  0.1  88460 81804 pts/0    S+   04:32   0:06 /var/tmp/wscales/tox/.pkg/bin/python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
wscales      226  0.0  0.0   4204  3244 pts/0    S+   04:32   0:00 /bin/bash /home/wscales/pbench/exec-tests /var/tmp/wscales/tox/py39
wscales      270  0.0  0.0   2332   912 pts/0    S+   04:32   0:00 /usr/bin/time --format=\n\nCommand: '%C'\nExit status: %x\nTimings: user %Us, system %Ss, elapsed %es (%E, %P)\nMemory: max RSS %Mk, minor pf: %R, major pf: %F, swaps %W\nContext switches: inv %c, vol %w, signals %k\nI/O: fs in %I, fs out %O, socket in %r, socket out %s\n pytest --tb=native -n auto --basetemp=/var/tmp/wscales/tox/py39/tmp --cov=/var/tmp/wscales/tox/py39/lib/python3.9/site-packages/pbench --cov-report html:/var/tmp/wscales/tox/py39/cov/html -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
wscales      271  7.8  0.1 698088 107188 pts/0   Sl+  04:32   0:09 /var/tmp/wscales/tox/py39/bin/python /var/tmp/wscales/tox/py39/bin/pytest --tb=native -n auto --basetemp=/var/tmp/wscales/tox/py39/tmp --cov=/var/tmp/wscales/tox/py39/lib/python3.9/site-packages/pbench --cov-report html:/var/tmp/wscales/tox/py39/cov/html -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
wscales      276 15.3  0.2 322884 164796 pts/0   Sl+  04:32   0:18 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      280  5.6  0.1 215148 126652 pts/0   Sl+  04:32   0:06 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      284 15.6  0.3 355472 196948 pts/0   Sl+  04:32   0:18 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      288 14.8  0.3 359424 197008 pts/0   Sl+  04:32   0:17 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      292 13.5  0.2 324888 166400 pts/0   Sl+  04:32   0:16 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      296 13.3  0.2 326292 167844 pts/0   Sl+  04:32   0:16 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      300 13.5  0.2 319792 161728 pts/0   Sl+  04:32   0:16 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      304 11.6  0.2 341464 180100 pts/0   Sl+  04:32   0:14 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))

webbnh commented 1 year ago

Nevermind!...

It turned out that one parametrize'd scenario of one test was sitting in an infinite loop, and that caused the test run to hang a few tests short of the end. Once I found and addressed the problem, the tests no longer hang.

That's great, but was there some way that I should have been able to find this more easily?

RonnyPfannschmidt commented 1 year ago

The timeout Plugin tends to be a great Help for hangups

nicoddemus commented 1 year ago

Anybody wants to contribute a change to the docs mentioning pytest-timeout? That would be excellent.

webbnh commented 1 year ago

Anybody wants to contribute a change to the docs mentioning pytest-timeout? That would be excellent.

😁 Indeed, especially if it points out the fact that it is hard to figure out which test is hanging without it!

RonnyPfannschmidt commented 1 year ago

@nicoddemus I'm wondering if xdist should ensure to identify all currently running tests and their phases whenever a node exceeds a predetermined timeframe

An even further expansion of this might be printing stacktraces

nicoddemus commented 1 year ago

Sounds good @RonnyPfannschmidt, indeed it makes sense for a new option to at least warn the user if a test has been running for X seconds (configurable, perhaps with a reasonable default of say 120s). However I would leave the job of cancelling long running test to pytest-timeout.

RonnyPfannschmidt commented 1 year ago

Indeed, debugging Print is fine but the terminate gun ought to be opt in

pytest-dev / pytest-xdist

pytest hangs while running tests #110