test_take_ownership_via_controller fails in Tor's "make test-stem"

teor2345 commented 4 years ago

I'm not sure if this is a stem or tor issue, or even a Travis load issue:

FAIL: test_take_ownership_via_controller

Traceback (most recent call last): File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 143, in self.method = lambda test: self.result(test) # method that can be mixed into TestCases File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 210, in result test.fail(self._result.msg) AssertionError: tor didn't quit after the controller that owned it disconnected

See https://travis-ci.org/torproject/tor/jobs/637520130#L3650 for details

atagar commented 4 years ago

Hi teor. You previously asked this on...

https://trac.torproject.org/projects/tor/ticket/32819

Stem waits up to five seconds for the process to terminate, and tor did not do so. The ball is now in tor's court to fix this, or tell me if the SLA should be something else.

teor2345 commented 4 years ago

Tor checks every 15 seconds, so the limit should be at least 20 seconds: https://github.com/torproject/tor/blob/4f02812242d1fd90d859eb98ac3fb1ed182f18cf/src/lib/evloop/procmon.c#L168

atagar commented 4 years ago

Thanks teor! Timeout changed...

https://gitweb.torproject.org/stem.git/commit/?id=e64a118

teor2345 commented 4 years ago

(The polling is a fallback on some systems, other systems get a notification about process changes immediately.)

teor2345 commented 4 years ago

Tor's CI is still failing with the same Stem test error: https://travis-ci.org/torproject/tor/jobs/639546796#L3657

Can you display the tor logs when a stem test fails? Then we'll have more information to help diagnose this issue.

atagar commented 4 years ago

Hi teor. Tor's logs are available on disk at...

Log debug file /home/travis/build/torproject/tor/stem/test/data/tor_log

Dumping the log for any and all test failures would be unhelpfully verbose. Rather, I'd suggest having the travis process that runs the tests cat that file when there is a non-zero exit status.

teor2345 commented 4 years ago

Thanks, give me a few moments, and I'll come back with the tor logs, and hopefully the next steps to fix these issues.

teor2345 commented 4 years ago

Here are the Tor logs: https://travis-ci.org/torproject/tor/jobs/639546796#L4730

And the stem trace logs: https://travis-ci.org/torproject/tor/jobs/639546796#L5731

I can't see the stem test names in either log, so I'm going to set up a CI branch that just runs the owing controller process test.

Unless you can find the output for that particular test?

teor2345 commented 4 years ago

Here's the failure in Travis:

Stem logs: https://travis-ci.org/teor2345/tor/builds/639735548#L5757

Tor logs: https://travis-ci.org/teor2345/tor/builds/639735548#L4708 I think the Tor logs are too verbose to show anything interesting. Is there a way of restricting them to info-level? Or particular log domains?

I can't grep for log domains, because they're not shown in these logs. But I can try to find the right functions.

I can also reproduce this failure locally with the latest tor and stem master.

teor2345 commented 4 years ago

This test passes on 0.4.2, but fails on master. So it's probably a Tor bug: https://travis-ci.org/torproject/tor/jobs/637709536

teor2345 commented 4 years ago

We're tracking it in https://trac.torproject.org/projects/tor/ticket/33006

teor2345 commented 4 years ago

This issue is caused by the precise timing of Tor's controller implementation.

On my machine, it passes on:

master without --enable-fragile-hardening
Tor 0.4.2 with --enable-fragile-hardening
Tor 0.4.2 without --enable-fragile-hardening

And it fails on:

master with --enable-fragile-hardening

On Travis, we see the same results, but Travis always uses --enable-fragile-hardening for its stem tests.

It looks like this timing issue was introduced in the https://trac.torproject.org/projects/tor/ticket/30984 refactor, in commit torproject/tor@c744d23c8d. (At least on my machine.)

Tor doesn't guarantee control reply timing. And we're unlikely to be able to restore the old timing behaviour. So stem's tests should be adapted to work with the timing in both Tor 0.4.2 and Tor master.

See https://trac.torproject.org/projects/tor/ticket/33006 for details.

atagar commented 4 years ago

Tor doesn't guarantee control reply timing. And we're unlikely to be able to restore the old timing behaviour. So stem's tests should be adapted to work with the timing in both Tor 0.4.2 and Tor master.

Hi teor. Sorry, I don't follow - what is the duration our tests should wait for tor to terminate, or are you saying we should simply drop the test?

teor2345 commented 4 years ago

This test works on Tor 0.4.2, and Tor master when compiled without extra hardening checks. But it fails when master is compiled with hardening.

So either:

there is a bug in Tor, where it sends different replies when hardening is turned on, or
there is a timing race condition in stem's test, where it depends on the exact millisecond timing of tor's replies.

I can't see any obvious changes in Tor's output, and I don't know enough about the design of stem's tests to further diagnose this issue.

We could try doing a diff of stem's trace logs on master, with and without hardening?

But I won't be able to do that today, I have to finish off a proposal for sponsored work.

atagar commented 4 years ago

there is a timing race condition in stem's test, where it depends on the exact millisecond timing of tor's replies

Gotcha. The test is simple...

Spawn a tor process with "__OwningControllerProcess [test pid]" among its arguments.
Connect then disconnect from its control port.
Wait up to twenty seconds for it to terminate.

Nothing about this involves precise timing, the only timebound part is a big twenty second window. This should be manually reproduceable without running the tests.

I'd be happy to adjust the test if there is something different that it should do, but for the moment I'm unsure if this is actionable on my part.

teor2345 commented 4 years ago

The tests connect to tor 4 times (or send 4 batches of commands), and they do a lot more than you've described.

The third and fourth connections are interleaved, and that interleaving appears to trigger a race condition in stem's tests (or stem, or tor).

But it's hard to diagnose this issue, because it's unclear from the stem trace logs:

how many tor instances there are,
how many connections there are,
what stem code is running on each connection (particularly unfinished connections), and
which TRACE logs belong to each tor instance and each connection.

Here are the detailed logs from the failing test at: https://travis-ci.org/torproject/tor/builds/639737322#L5715

The first connection isn't part of the test, but it sets OwningControllerProcess=20405, runs TAKEOWNERSHIP, then resets OwningControllerProcess:

01/21/2020 01:31:27 [TRACE] Sent to tor: PROTOCOLINFO 1 01/21/2020 01:31:27 [TRACE] Received from tor: 250-PROTOCOLINFO 1 250-AUTH METHODS=NULL 250-VERSION Tor="0.4.3.0-alpha-dev" 250 OK 01/21/2020 01:31:27 [TRACE] Sent to tor: AUTHENTICATE 01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:27 [TRACE] Sent to tor: SETEVENTS SIGNAL CONF_CHANGED STATUS_SERVER 01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:27 [TRACE] Sent to tor: GETCONF owningcontrollerprocess 01/21/2020 01:31:27 [TRACE] Received from tor: 250 OwningControllerProcess=20405 01/21/2020 01:31:27 [DEBUG] GETCONF owningcontrollerprocess (runtime: 0.0007) 01/21/2020 01:31:27 [TRACE] Sent to tor: TAKEOWNERSHIP 01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:27 [TRACE] Sent to tor: RESETCONF OwningControllerProcess 01/21/2020 01:31:27 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:27 [TRACE] Received from tor: 650-CONF_CHANGED 650-OwningControllerProcess 650 OK 01/21/2020 01:31:27 [DEBUG] RESETCONF OwningControllerProcess (runtime: 0.0127) ...

Are these commands sent to the same tor process as the test?

The next connections are part of the test:

01/21/2020 01:31:27 [NOTICE] Beginning test test.integ.process.TestProcess.test_take_ownership_via_controller 01/21/2020 01:31:28 [TRACE] Sent to tor: PROTOCOLINFO 1 ...

The second connection gets the control port:

01/21/2020 01:31:28 [TRACE] Received from tor: 250-PROTOCOLINFO 1 250-AUTH METHODS=NULL 250-VERSION Tor="0.4.3.0-alpha-dev" 250 OK 01/21/2020 01:31:28 [TRACE] Sent to tor: AUTHENTICATE 01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:28 [TRACE] Sent to tor: GETCONF ControlPort 01/21/2020 01:31:28 [TRACE] Received from tor: 250 ControlPort=14241

The third connection gets OwningControllerProcess=20486, runs TAKEOWNERSHIP, and runs RESETCONF OwningControllerProcess, and then is interrupted by the fourth connection:

01/21/2020 01:31:28 [TRACE] Sent to tor: PROTOCOLINFO 1 01/21/2020 01:31:28 [TRACE] Received from tor: 250-PROTOCOLINFO 1 250-AUTH METHODS=NULL 250-VERSION Tor="0.4.3.0-alpha-dev" 250 OK 01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:28 [TRACE] Sent to tor: AUTHENTICATE 01/21/2020 01:31:28 [TRACE] Sent to tor: SETEVENTS SIGNAL CONF_CHANGED STATUS_SERVER 01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:28 [TRACE] Sent to tor: GETCONF owningcontrollerprocess 01/21/2020 01:31:28 [TRACE] Received from tor: 250 OwningControllerProcess=20486 01/21/2020 01:31:28 [DEBUG] GETCONF owningcontrollerprocess (runtime: 0.0007) 01/21/2020 01:31:28 [TRACE] Sent to tor: TAKEOWNERSHIP 01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK 01/21/2020 01:31:28 [TRACE] Sent to tor: RESETCONF OwningControllerProcess

The fourth connection starts:

01/21/2020 01:31:28 [TRACE] Sent to tor: PROTOCOLINFO 1 01/21/2020 01:31:28 [TRACE] Received from tor: 250-PROTOCOLINFO 1 250-AUTH METHODS=NULL 250-VERSION Tor="0.4.3.0-alpha-dev" 250 OK 01/21/2020 01:31:28 [TRACE] Sent to tor: AUTHENTICATE 01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK

It's unclear which connection this is, but it's probably the fourth connection:

01/21/2020 01:31:28 [TRACE] Sent to tor: GETCONF ControlPort 01/21/2020 01:31:28 [TRACE] Received from tor: 250 OK

It's unclear which connection this is, but it's probably the third connection:

01/21/2020 01:31:28 [TRACE] Received from tor: 650-CONF_CHANGED 650-__OwningControllerProcess 650 OK

It's unclear which connection this is, but it's probably the fourth connection:

01/21/2020 01:31:28 [TRACE] Received from tor: 250 ControlPort=50000

It's unclear which connection this is, but it's probably the end of the third connection:

01/21/2020 01:31:28 [DEBUG] RESETCONF __OwningControllerProcess (runtime: 0.0146)

It appears that the test never finishes the commands to the fourth connection. It looks like stem stopped writing commands to the fourth connection. There's no final [DEBUG] log for the fourth connection, so I can't tell what it was trying to do.

And then stem makes a few system calls:

01/21/2020 01:31:28 [DEBUG] System call: /usr/bin/python3 setup.py install --prefix /tmp/stem_test (runtime: 1.56) ...

And 20 seconds later, the test finishes, due to a timeout failure:

01/21/2020 01:31:48 [NOTICE] Finished test test.integ.process.TestProcess.test_take_ownership_via_controller

Here's what I need to know to diagnose this issue:

are these connections, or batches of commands?
are they all sent to the same tor process?
which control commands are running on each connection and tor process?

Here's what stem might be doing wrong:

does the third connection actually close? (If it doesn't, then the TAKEOWNERSHIP from that connection won't make tor exit.)
why is there a fourth connection? (As long as the third connection closes, it shouldn't matter what it does, but it seems weird.)
does tor actually close, but stem is hung on the fourth connection, so it never notices?

atagar commented 4 years ago

The tests connect to tor 4 times (or send 4 batches of commands), and they do a lot more than you've described.

I suspect you're getting confused by multiple tests running in parallel. Above I advised reproducing this outside the tests - did you try that?

If not, here's how you do so...

import logging
import os
import tempfile
import time

import stem.control
import stem.process
import stem.util.log

DATA_DIR = '/tmp/demo_test'

def test_take_ownership_via_controller():
  with tempfile.TemporaryDirectory() as data_directory:
    control_port = '1244'

    tor_process = stem.process.launch_tor_with_config(
      tor_cmd = 'tor',
      config = {
        'SocksPort': '1245',
        'ControlPort': control_port,
        'DataDirectory': DATA_DIR,
      },
      completion_percent = 0,
      take_ownership = True,
    )

    # We're the controlling process. Just need to connect then disconnect.

    controller = stem.control.Controller.from_port(port = int(control_port))
    controller.authenticate()
    controller.close()

    # give tor a few seconds to quit
    start_time = time.time()

    while time.time() - start_time < 20:
      if tor_process.poll() == 0:
        print('tor terminated after %0.2f seconds' % (time.time() - start_time))
        return  # tor exited

      time.sleep(0.01)

    raise AssertionError("tor didn't quit after the controller that owned it disconnected")

if __name__ == '__main__':
  if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

  # set up trace logging

  handler = logging.FileHandler(os.path.join(DATA_DIR, 'log'), mode = 'w')
  handler.setLevel(logging.DEBUG - 5)
  handler.setFormatter(stem.util.log.FORMATTER)
  stem.util.log.get_logger().addHandler(handler)

  test_take_ownership_via_controller()

When I run that I get...

atagar@morrigan:~/Desktop/stem$ python3.7 demo.py 
tor terminated after 0.01 seconds

atagar@morrigan:~/Desktop/stem$ cat /tmp/demo_test/log 
01/22/2020 18:21:10 [DEBUG] System call: tor --version (runtime: 0.01)
01/22/2020 18:21:10 [TRACE] Received from system (tor --version), stdout:
b'Tor version 0.4.3.0-alpha-dev (git-5e70c27e8560ac18).'
01/22/2020 18:21:10 [TRACE] Sent to tor: PROTOCOLINFO 1
01/22/2020 18:21:10 [TRACE] Received from tor:
250-PROTOCOLINFO 1
250-AUTH METHODS=NULL
250-VERSION Tor="0.4.3.0-alpha-dev"
250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: AUTHENTICATE
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: SETEVENTS SIGNAL CONF_CHANGED STATUS_SERVER
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: GETCONF __owningcontrollerprocess
01/22/2020 18:21:10 [TRACE] Received from tor: 250 __OwningControllerProcess=27969
01/22/2020 18:21:10 [DEBUG] GETCONF __owningcontrollerprocess (runtime: 0.0005)
01/22/2020 18:21:10 [TRACE] Sent to tor: TAKEOWNERSHIP
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Sent to tor: RESETCONF __OwningControllerProcess
01/22/2020 18:21:10 [TRACE] Received from tor: 250 OK
01/22/2020 18:21:10 [TRACE] Received from tor:
650-CONF_CHANGED
650-__OwningControllerProcess
650 OK
01/22/2020 18:21:10 [DEBUG] RESETCONF __OwningControllerProcess (runtime: 0.0043)
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "read of closed file"
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "peek of closed file"
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "peek of closed file"
01/22/2020 18:21:10 [INFO] Error while receiving a control message (SocketClosed): received exception "peek of closed file"

To prove the test worked I commented out the 'take_ownership' line and it failed after twenty seconds as expected with an assertion error.

Try that with your hardened tor instance by adding that argument to tor_cmd and see if that reproduces the problem. If so, this narrows the haystack. And if not then the next step is for us to puzzle out why.

teor2345 commented 4 years ago

Those logs come from this command-line:

python3 "$STEM_SOURCE_DIR"/run_tests.py --tor src/app/tor --integ --test process.test_take_ownership_via_controller --log TRACE --log-file stem.log https://travis-ci.org/torproject/tor/jobs/639737323/config

As far as I can see, it only runs the problematic test:

Running tests...

process.test_take_ownership_via_controller... failed (20.74s)

test_take_ownership_via_controller [FAILURE]

======================================================================

FAIL: test_take_ownership_via_controller https://travis-ci.org/torproject/tor/builds/639737322#L3650

I'll have time to do more on this issue next week.

atagar commented 4 years ago

Yes, it's only running one test but it's still bootstrapping a copy of tor in anticipation of running all of the other tests. I can explain the log output if you'd like, but it would be simpler if we narrow the investigation to the script provided above.

teor2345 commented 4 years ago

Nick and Taylor have discovered that Tor is leaking memory due to a control refactor in master. The memory leaks result in a different exit status when tor is built with --enable-fragile-hardening: https://trac.torproject.org/projects/tor/ticket/33039

Here's two stem changes that will help us diagnose issues like this faster in future:

At the moment, stem just hangs when poll() returns an unexpected exit status. Can you please make stem raise an exception instead? If stem raises an exception, we can discover these kinds of failures quickly in the test logs, rather than having to reproduce each error in an isolated test case.

It would also be helpful if stem logged tor's stderr output to a file. Address sanitiser errors are always output on stderr. Tor crash backtraces are also usually output on stderr.

nmathewson commented 4 years ago

Nick and Taylor have discovered that Tor is leaking memory due to a control refactor in master. The memory leaks result in a different exit status when tor is built with --enable-fragile-hardening: https://trac.torproject.org/projects/tor/ticket/33039

Here's two stem changes that will help us diagnose issues like this faster in future:

At the moment, stem just hangs when poll() returns an unexpected exit status. Can you please make stem raise an exception instead? If stem raises an exception, we can discover these kinds of failures quickly in the test logs, rather than having to reproduce each error in an isolated test case.

I have a PR for this part, for your consideration: See #54.

atagar commented 4 years ago

Thanks Nick! Great catch, merged your patch. As for stderr we do emit that when tor crashes...

https://trac.torproject.org/projects/tor/ticket/27717 https://gitweb.torproject.org/stem.git/commit/?id=0b734b6

However, this only concerns the main tor process we test against. Our integ/process.py spawn separate tor instances to exercise startup/exiting behavior. Their asynchronicity also complicates this (these tests each run in subprocesses). Doable if necessary, but between the significant effort and tiny scope I'd advise against it for now.

torproject / stem

test_take_ownership_via_controller fails in Tor's "make test-stem" #52

FAIL: test_take_ownership_via_controller