They aren't happening very often right now, so this issue is currently a low priority. (But our experience of CI is that intermittent issues can suddenly happen all the time, due to code or infrastructure changes.)
These hangs could be caused by a tor bug, a stem bug, a race condition, or a Travis CI infrastructure issue.
I can't see any obvious bugs here, but if you find a stem or tor issue, please let us know.
To help us diagnose these kinds of issues, we'd like stem's blocking and busy-waiting operations to have a timeout. Travis CI times out after 10 minutes, and we terminate stem after 9 minutes. So any timeout should be less than 8 minutes. (Or lower, if we need to allow more test setup and run time.)
If stem should block indefinitely, and only timeout during tests, we're happy to set a command-line argument or environmental variable.
We can add timeouts as needed, we don't need to fix everything in one go.
So let's focus on these two timeouts in this ticket:
queue.get(), in response to the timelimit warning USR1 signal at 9 minutes:
Running tests...
control.base_controller... success (1.51s)
control.controller... timelimit: sending warning signal 10
...
File "/home/travis/build/teor2345/tor/stem/test/integ/control/controller.py", line 1468, in test_transition_to_relay
controller.set_options({'OrPort': '9090', 'DisableNetwork': '1'})
File "/home/travis/build/teor2345/tor/stem/stem/control.py", line 2413, in set_options
response = self.msg(query)
File "/home/travis/build/teor2345/tor/stem/stem/control.py", line 644, in msg
response = self._reply_queue.get()
File "/usr/lib/python3.6/queue.py", line 164, in get
self.not_empty.wait()
File "/usr/lib/python3.6/threading.py", line 295, in wait
waiter.acquire()
...
Looping time.sleep(0.1), in response to the timelimit termination ABRT signal 30 seconds later:
File "/home/travis/build/teor2345/tor/stem/test/integ/installation.py", line 64, in run_tests
stem.util.test_tools.ASYNC_TESTS['test.integ.installation.test_sdist'].run(test_install.pid())
File "/home/travis/build/teor2345/tor/stem/stem/util/test_tools.py", line 187, in run
self._process.start()
File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
...
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/travis/build/teor2345/tor/stem/stem/util/test_tools.py", line 155, in _wrapper
runner(*args) if args else runner()
File "/home/travis/build/teor2345/tor/stem/test/integ/installation.py", line 135, in test_sdist
time.sleep(0.1) # we need to run these tests serially
...
I'm cc'ing @ahf on this issue, because I'm transitioning CI responsibilities to him.
We're seeing some hangs in the stem unit tests in tor's CI: https://travis-ci.org/github/teor2345/tor/jobs/681198159#L3671
They aren't happening very often right now, so this issue is currently a low priority. (But our experience of CI is that intermittent issues can suddenly happen all the time, due to code or infrastructure changes.)
These hangs could be caused by a tor bug, a stem bug, a race condition, or a Travis CI infrastructure issue.
I can't see any obvious bugs here, but if you find a stem or tor issue, please let us know.
To help us diagnose these kinds of issues, we'd like stem's blocking and busy-waiting operations to have a timeout. Travis CI times out after 10 minutes, and we terminate stem after 9 minutes. So any timeout should be less than 8 minutes. (Or lower, if we need to allow more test setup and run time.)
If stem should block indefinitely, and only timeout during tests, we're happy to set a command-line argument or environmental variable.
We can add timeouts as needed, we don't need to fix everything in one go.
So let's focus on these two timeouts in this ticket:
queue.get()
, in response to the timelimit warning USR1 signal at 9 minutes:Looping
time.sleep(0.1)
, in response to the timelimit termination ABRT signal 30 seconds later:I'm cc'ing @ahf on this issue, because I'm transitioning CI responsibilities to him.