meejah / txtorcon

Twisted-based asynchronous Tor control protocol implementation. Includes unit-tests, examples, state-tracking code and configuration abstraction.
http://fjblvrw2jrxnhtg67qpbzi45r7ofojaoo3orzykesly2j3c2m3htapid.onion/
MIT License
250 stars 72 forks source link

race condition on slow disk #349

Open cariaso opened 3 years ago

cariaso commented 3 years ago

https://github.com/magic-wormhole/magic-wormhole uses txtorcon. if I run wormhole receive 3-some-code --tor --launch-tor it will call into txtorcon.

However in my current environment 100% of the time it will quickly crash with the message

launching a new Tor process, this may take a while..
 Unhandled Error
 Traceback (most recent call last):
 Failure: twisted.internet.error.ConnectError: An error occurred while connecting: 2: No such file or directory.

however if I add a time.sleep(0.5) after the txtorcon/controller.py line 360 call

 transport = reactor.spawnProcess(
        process_protocol,
        tor_binary,
        args=args,
        env={'HOME': data_directory},
        path=data_directory if os.path.exists(data_directory) else None,  # XXX error if it doesn't exist?
    )

The problem goes away. (Smaller sleeps seem to work, but I've not measured the exact threshold). I expect this is somehow related to the fact that I'm running off of networked storage.

Can anyone offer deeper insight into this, and perhaps a suitable solution.

meejah commented 3 years ago

Hmmm!

Very interesting ... from the error I assume this is an error while trying to connect to a unix-based control socket. By "networked storage" you mean NFS or ...? (I have no idea how unix-sockets might work on such storage ;) )

cariaso commented 3 years ago

I assume this is an error while trying to connect to a unix-based control socket.

yes

By "networked storage" you mean NFS or ...?

AWS EBS gp3 is mounted as the storage for a docker container

It may sound complicated, but works surprisingly well. Across many applications this is the first issue I've encountered.

https://github.com/cariaso/txtorcon/commits/main has been sufficient for my needs.

meejah commented 3 years ago

Obviously a delay isn't ever going to be the right thing (and, for Twisted code, time.sleep(...) is definitely not the right way to delay).

So, I think what's really going on here is this: when Tor is launched, it takes some amount of time until we can connect to the control socket. Currently, that is determined by watching Tor's logs (e.g. https://github.com/meejah/txtorcon/blob/main/txtorcon/controller.py#L1280 looks for the "Opening control ..." line).

I suspect what's happening is that on your "slow" disk, Tor is writing the control socket, printing that line to stdout, but the actual file hasn't been sync'd (or whatever) yet? So then immediately after that, txtorcon tries to connect, but there's no socket.

So I can think of two "more proper" fixes:

The latter will make things more-robust, but also might fail slightly slower in some cases (oh well). I like that the latter thing doesn't have any special-case code (e.g. "is it a unix-socket?", parse file, etc, etc).