guardianproject / tor-android

Tor binary and library for Android
https://torproject.org
Other
214 stars 61 forks source link

stuck on "waiting for lock" #67

Closed n8fr8 closed 2 years ago

n8fr8 commented 2 years ago

tor 0.4.6.8

integrated with Orbot, on various cycles of start/stop, I can generate this error:

2021-12-21 12:48:02.856 7322-7622/org.torproject.android W/System.err: java.io.IOException: /data/user/0/org.torproject.android/app_TorService/data/ControlSocket open: Connection refused 2021-12-21 12:48:02.856 7322-7622/org.torproject.android W/System.err: at org.torproject.jni.TorService.prepareFileDescriptor(Native Method) 2021-12-21 12:48:02.856 7322-7622/org.torproject.android W/System.err: at org.torproject.jni.TorService.access$300(TorService.java:43) 2021-12-21 12:48:02.856 7322-7622/org.torproject.android W/System.err: at org.torproject.jni.TorService$2.run(TorService.java:267)

after that happens, I can no longer start tor, because it doesn't proceed past this error: 2021-12-21 12:49:09.738 7322-7718/org.torproject.android I/TorService: Waiting for lock

even if unbind, and stop the OrbotService it still happens. Only a force stop of the app resolved the issue.

eighthave commented 2 years ago

@bitmold so is this an issue with OnionService?

bitmold commented 2 years ago

I assume you mean OrbotService, and I'm not sure yet, looking into this now...

n8fr8 commented 2 years ago

This is still happening with Orbot testing. The lock is not released sometimes, after multiple stops and starts.

We are using "unbindService(torServiceConnection);" to disconnect and stop Tor.

eighthave commented 2 years ago

@n8fr8 @bitmold can you provide steps to reproduce?

@akwizgran @grote any ideas here? it seems related to the lock that was added in #59

akwizgran commented 2 years ago

Looking for places where the thread started by startTorServiceThread() could get stuck while holding the lock, one candidate is the call to controlPortThreadStarted.await(). If controlPortThreadStarted.countDown() is not called for some reason then the thread started by startTorServiceThread() will wait here indefinitely, even if the service instance has been destroyed in the meantime and replaced by a new instance that wants to acquire the lock.

However, I can't see any obvious reasons why controlPortThreadStarted.countDown() would fail to be called. All the control port thread does before counting down the latch is setting the thread priority and starting to watch a file observer. Could either of those throw a RuntimeException that would kill the control port thread before it counts down the latch?

Looking for other candidates that could cause the thread started by startTorServiceThread() to get stuck while holding the lock, the only other one I can see would be runMain() failing to return. This could happen if for some reason onDestroy() wasn't called when unbinding the service (service manually started as well as bound?), or if torControlConnection was null when onDestroy() was called, so that shutdownTor() would fail to call torControlConnection.shutdownTor().

torControlConnection is initialised by the control port thread, after it counts down controlPortThreadStarted latch and does some other potentially exception-throwing stuff. So I have two theories to suggest:

  1. The control port thread counts down the latch, then the service is destroyed before the control port thread initialises torControlConnection (race condition, maybe exacerbated by torControlConnection not being volatile)
  2. The control port thread counts down the latch, then dies for some reason before initialising torControlConnection (seems like the reason here would have to be related to opening the control socket, as that's what the control port thread does between counting down the latch and initialising torControlConnection)

In either case, shutdownTor() finds that torControlConnection is null, so it doesn't call torControlConnection.shutdownTor(). Tor isn't told to shut down, so runMain() doesn't return. The lock remains held and the new service instance waits forever to acquire it.

akwizgran commented 2 years ago

To test these theories I'd suggest the following:

If the problem still happens then my theories were wrong but we made some harmless clean-ups. :-)

bitmold commented 2 years ago

Ah ha, I had forgotten about this issue on the other thread in Orbot.

There are crude steps to reproduce this issue here: https://github.com/guardianproject/orbot/issues/621 ~ But for what it's worth this still occurs in Orbot

bitmold commented 2 years ago

Even though @n8fr8 reported it in his stack trace, it was a bit surprising to me to see that Java_org_torproject_jni_TorService_prepareFileDescriptor will, in native code throw an IOException

// ...
  jclass io_exception = (*env)->FindClass(env, "java/io/IOException");
  if (io_exception == NULL) {
    return NULL;
  }
  if (fd < 0 || connect(fd, (struct sockaddr *) &addr, sizeof(addr)) == -1) {
    char buf[1024];
    snprintf(buf, 1023, "%s open: %s", filename, strerror(errno));
    (*env)->ThrowNew(env, io_exception, buf);
    return NULL;
  }
// ...

It seems like when this is caught in java TorService stops itself with runLock locked

            } catch (IOException | ArrayIndexOutOfBoundsException | InterruptedException e) {
                e.printStackTrace();
                broadcastStatus(TorService.this, STATUS_STOPPING);
                TorService.this.stopSelf();
            }

Maybe we shouldn't call controlPortThreadStarted.countDown() until after the native call to prepareFileDescriptor. I'm going to look into this line of thinking tomorrow...

bitmold commented 2 years ago

@akwizgran sadly your insights weren't immediately helpful in fixing this bug, I appreciate them though they have definitely been helpful in figuring out this problem

eighthave commented 2 years ago

How about something like:


                } catch (IllegalStateException | IllegalArgumentException | InterruptedException e) {
                    e.printStackTrace();
                    broadcastError(context, e);
                    broadcastStatus(context, STATUS_STOPPING);
                    TorService.this.stopSelf();
                    if (runLock.isLocked()) {
                        Log.i(TAG, "Releasing lock");
                        runLock.unlock();
                    }
                }
eighthave commented 2 years ago

nevermind, I was looking at old code. Is there repro instructions somewhere? Ideally, there would be a test case that reproduces it, that should make it easy to fix.

eighthave commented 2 years ago

I agree with @akwizgran's ideas in https://github.com/guardianproject/tor-android/issues/67#issuecomment-1043258065 Here is a diff that implements 1 and 2. @bitmold did you try those? I think 3 would also be worth trying.

--- a/tor-android-binary/src/main/java/org/torproject/jni/TorService.java
+++ b/tor-android-binary/src/main/java/org/torproject/jni/TorService.java
@@ -171,7 +171,7 @@ public class TorService extends Service {
     private long torConfiguration = -1;
     private int torControlFd = -1;

-    private TorControlConnection torControlConnection;
+    private volatile TorControlConnection torControlConnection;

     /**
      * This lock must be acquired before calling createTorConfiguration() and
@@ -261,7 +261,6 @@ public class TorService extends Service {
                     }
                 };
                 controlPortFileObserver.startWatching();
-                controlPortThreadStarted.countDown();
                 countDownLatch.await(10, TimeUnit.SECONDS);
                 controlPortFileObserver.stopWatching();
                 File controlSocket = new File(observeDir, CONTROL_SOCKET_NAME);
@@ -281,6 +280,8 @@ public class TorService extends Service {
                 socksPort = getPortFromGetInfo("net/listeners/socks");
                 httpTunnelPort = getPortFromGetInfo("net/listeners/httptunnel");

+                controlPortThreadStarted.countDown();
+
             } catch (IOException | ArrayIndexOutOfBoundsException | InterruptedException e) {
                 e.printStackTrace();
                 broadcastStatus(TorService.this, STATUS_STOPPING);
@@ -289,7 +290,7 @@ public class TorService extends Service {
         }
     };

-    private CountDownLatch controlPortThreadStarted;
+    private volatile CountDownLatch controlPortThreadStarted;

     private int getPortFromGetInfo(String key) {
         final String value = getInfo(key);
bitmold commented 2 years ago

I did try those, to no avail.

It seems like you could move controlPortThreadStarted.countDonw() down to after httpTunnelPort's assignment, but that actually won't work. The reason is that in the thread named "tor" the tor binary won't actually run until controlPortThreadStarted is done waiting. Since controlPortThreadStarted in your example above only counts down after the control port is authenticated, events are set, etc and this only happens after tor is up and running.

I'm currently investigating a solution with two CountDownLatch's. One for starting the control port thread and basically waiting for the ContorlSocket file to be created (which means you can run the tor binary program), and a second one after httpTunnelPort is assigned. When both of these latches are over and done with, perhaps then we can unlock runLock...

eighthave commented 2 years ago

sounds like a good idea. If you implemented @akwizgran's #3 idea, could you share it with me? It seems that it would be good to have in TorService anyway.