pytest-dev / pytest-xdist

pytest plugin for distributed testing and loop-on-failures testing modes.
https://pytest-xdist.readthedocs.io
MIT License
1.44k stars 227 forks source link

pytest-xdist server side timeout #57

Open limaoscarjuliet opened 8 years ago

limaoscarjuliet commented 8 years ago

NOTE: This is not about timeout for test code itself (pytest-timeout works well here), this is about need for timeout in pytest-xdist.

First, let me say big thank you for pytest and pytest-xdist. We use it to run ~400 Docker containers on ~10 servers on AWS. It works wonders!

There are scenarios where pytest-xdist does not detect remote session crash or disconnect and as such will wait for results forever.

Today's xdist code detects session crash via EOF on the SSH session. When network connection is torn down, server marks the worker as dead, and re-adds it. All good.

But... consider a scenario where the SSH is not torn down:

  1. Run N tests on multiple remote machines with pytest-xdist,
  2. Tests spawn a python process on remote machine via SSH
  3. We run in boxed mode, so this process forks to run actual test code
  4. Process #2 gets killed or crashes
  5. SSH session stays up because process #3 inherited at least one stdin/out/err from the process #2 (standard SSH behavior).

In this case, the server side xdist thinks the session is up and is waiting for the results for really, really long time ;-)

And yes, #2 does not crash normally. In our case it was oom killed quite persistently. All it takes is 1 oom kill for tens of thousands of tests and entire batch is ruined.

Please let me know if I can provide more info on this issue.

[root@nsth-c10 nsth] #.python --version Python 2.7.10 [root@nsth-c10 nsth] #.py.test --version This is pytest version 2.8.0, imported from /usr/local/lib/python2.7/site-packages/pytest-2.8.0-py2.7.egg/pytest.pyc setuptools registered plugins: pytest-xdist-1.13.1 at /usr/local/lib/python2.7/site-packages/pytest_xdist-1.13.1-py2.7.egg/xdist/boxed.pyc pytest-xdist-1.13.1 at /usr/local/lib/python2.7/site-packages/pytest_xdist-1.13.1-py2.7.egg/xdist/looponfail.pyc pytest-xdist-1.13.1 at /usr/local/lib/python2.7/site-packages/pytest_xdist-1.13.1-py2.7.egg/xdist/plugin.pyc [root@nsth-c10 nsth] #.

P.S. Moved from https://github.com/pytest-dev/pytest/issues/1550

RonnyPfannschmidt commented 8 years ago

i think this one is dependent on #20 - with the current codebase its really tricky to introduce heartbeats on top of the support for node-restarts

since we cant detect a dead ssh due to the default behaviour we need some kind of heartbeat mechanism, so we can be aware of sessions in a unresponsive state

i think this is a item for execnet itself

limaoscarjuliet commented 8 years ago

We addressed the underlying root cause by increasing amount of memory each container can use (docker -mem option). But, of course, there are other ways it may lock up or crash, so addressing this will help.

Thank you for taking this into account in the future.