Timeout when tor or stem hang during unit tests

We're seeing some hangs in the stem unit tests in tor's CI: https://travis-ci.org/github/teor2345/tor/jobs/681198159#L3671

They aren't happening very often right now, so this issue is currently a low priority. (But our experience of CI is that intermittent issues can suddenly happen all the time, due to code or infrastructure changes.)

These hangs could be caused by a tor bug, a stem bug, a race condition, or a Travis CI infrastructure issue.

I can't see any obvious bugs here, but if you find a stem or tor issue, please let us know.

To help us diagnose these kinds of issues, we'd like stem's blocking and busy-waiting operations to have a timeout. Travis CI times out after 10 minutes, and we terminate stem after 9 minutes. So any timeout should be less than 8 minutes. (Or lower, if we need to allow more test setup and run time.)

If stem should block indefinitely, and only timeout during tests, we're happy to set a command-line argument or environmental variable.

We can add timeouts as needed, we don't need to fix everything in one go.

So let's focus on these two timeouts in this ticket:

queue.get(), in response to the timelimit warning USR1 signal at 9 minutes:

Running tests...
  control.base_controller...                           success (1.51s)
  control.controller...                               timelimit: sending warning signal 10
...
  File "/home/travis/build/teor2345/tor/stem/test/integ/control/controller.py", line 1468, in test_transition_to_relay
    controller.set_options({'OrPort': '9090', 'DisableNetwork': '1'})
  File "/home/travis/build/teor2345/tor/stem/stem/control.py", line 2413, in set_options
    response = self.msg(query)
  File "/home/travis/build/teor2345/tor/stem/stem/control.py", line 644, in msg
    response = self._reply_queue.get()
  File "/usr/lib/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/usr/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
...

Looping time.sleep(0.1), in response to the timelimit termination ABRT signal 30 seconds later:

  File "/home/travis/build/teor2345/tor/stem/test/integ/installation.py", line 64, in run_tests
    stem.util.test_tools.ASYNC_TESTS['test.integ.installation.test_sdist'].run(test_install.pid())
  File "/home/travis/build/teor2345/tor/stem/stem/util/test_tools.py", line 187, in run
    self._process.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
...
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/teor2345/tor/stem/stem/util/test_tools.py", line 155, in _wrapper
    runner(*args) if args else runner()
  File "/home/travis/build/teor2345/tor/stem/test/integ/installation.py", line 135, in test_sdist
    time.sleep(0.1)  # we need to run these tests serially
...

I'm cc'ing @ahf on this issue, because I'm transitioning CI responsibilities to him.

torproject / stem

Timeout when tor or stem hang during unit tests #61