relaycorp / relaynet-gateway-android

Awala Gateway for Android
https://relaynet.network/users
GNU General Public License v3.0
7 stars 0 forks source link

Incoming messages are not received until the app is killed #724

Closed gnarea closed 7 months ago

gnarea commented 9 months ago

Describe the bug

This app is constantly failing to collect parcels from its Internet gateway, until we kill it or switch networks. There are no issues on the server as far as I can see.

We can reliably reproduce this within a few minutes of using Letro. It's basically #711 and #717 with a twist: Most of the time, the Awala app reports it is connected to the Internet (this is new since those two issues).

How to reproduce

  1. Install the latest version of Letro from source on two phones.
  2. Create an account on each phone.
  3. Pair the two users.
  4. Send messages to each other.

At some point between steps 2 and 4, things will get stuck due to this issue.

Additional context

I typically test with a Pixel 2 (Android 11) and a Pixel 5 (Android 14), and this happens far more often on the Pixel 2.

sdsantos commented 9 months ago

I am able to reproduce the issue quickly. What I've seen so far is:

Possible issues:

  1. The InternetGateway is not honouring the keepAlive=true and only delivers the parcels it has once it starts
  2. The PoWebClient or the CollectParcels code for receiving parcels has some bug that prevents from getting more than 1 parcel
  3. We have a web socket connection issue that makes it fail very quickly and never recover, even with our retries

I'll investigate more of 2 next week, and if that's not clear I'll see if I can check 3 too. @gnarea if you could double check 1 it would be help too.

gnarea commented 9 months ago

Thanks for looking into this Sergio!

I changed the code to trigger collect parcels with keepAlive=false (PublicSync.syncOneOff) and we get parcels fine.

Just to clarify, do you mean that you changed that today in a local branch (not the remote master)? If so, did that appear to solve the problem consistently?

The InternetGateway is not honouring the keepAlive=true and only delivers the parcels it has once it starts

If forcing keepAlive=false causes parcels to be retrieved reliably, then it does indeed sound like a bug on the Internet gateway. The weird thing then is that it does appear to happen a lot more often with the older phones (Nexus 5X/Android 8 and Pixel 2/Android 11), than the newer ones (Pixel 5 and 8 with Android 14).

sdsantos commented 9 months ago

Just to clarify, do you mean that you changed that today in a local branch (not the remote master)? If so, did that appear to solve the problem consistently?

It's something I tried on a local branch. Just a simple global scope:

while (true) {
  PublicSync.syncOneOff()
  delay(10.seconds)
}

And that seemed to catch all parcels.

gnarea commented 9 months ago

Right, then that definitely sounds like an issue in the Internet gateway.

In that case, please leave this issue with me. I'm planning to do a major upgrade to the Frankfurt gateway in the second week of January, and it's pretty likely to fix this issue (I'm basically replacing the messaging queue server, as it's causing many issues).

I'll close this issue then if it's fixed by the infrastructure update, or ping you if it hasn't been resolved.

gnarea commented 8 months ago

I'm testing the latest major version of the Internet gateway and I'm not seeing this issue, so I think that confirms our hypothesis.

(I'm not releasing this version yet because I found a different, major issue that I have to fix first)

gnarea commented 8 months ago

@sdsantos, I'm afraid I spoke too soon. I'm still experiencing this issue frequently with the new Internet gateway (belgium.relaycorp.services), which isn't susceptible to the issues we experienced in the old environment in Frankfurt (messages would often get stuck in the old environment).

I've checked the server logs and it all looks good to me. Messages are delivered quickly internally -- The private gateway is just not collecting parcels after a while, even though I can't see anything interesting in the logs and the app says we're connected to the Internet.

Happy to jump on a call to debug this in real-time.

sdsantos commented 8 months ago

I've tested with the new Internet gateway, the Ping app and my phone, and after 15 minutes of using the app everything I'm still getting Pongs promptly. Could there be other factors at play? Do you close the Letro/Ping app? Does the phone go into sleep?

gnarea commented 8 months ago

Thanks for looking into this!

Do you close the Letro/Ping app? Does the phone go into sleep?

Yes to both questions.

BTW, happens fairly often on Android 14 (Pixel 5 and 8).

sdsantos commented 7 months ago

I now can reproduce the issue after putting the phone to sleep for a couple of minutes (turning off the screen). The gateway receives the Pong without issue and notifies the Ping app (context.sendBroadcast), but the Ping app does not receive the Pong. Will investigate further...

(Now I'm not getting Pongs back, maybe I overworked the Ping server? 😛)

gnarea commented 7 months ago

(Now I'm not getting Pongs back, maybe I overworked the Ping server? 😛)

The Internet gateway has been having intermittent issues for the past few days, which I've escalated to GCP. I don't think it's been caused by what you've been doing but it'd certainly affect you. Can you please try again?

Here's the status page if you'd like to know if an issue is due to an outage: https://stats.uptimerobot.com/6GD9MhoLpZ

sdsantos commented 7 months ago

More information:

Process: tech.relaycorp.ping, PID: 5650
tech.relaycorp.relaynet.bindings.pdc.ServerConnectionException: Server is unreachable
    at tech.relaycorp.poweb.PoWebClient.wsConnect$poweb(PoWebClient.kt:306)
    at tech.relaycorp.poweb.PoWebClient$wsConnect$1.invokeSuspend(Unknown Source:15)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:194)
    at io.ktor.util.pipeline.SuspendFunctionGun.access$resumeRootWith(SuspendFunctionGun.kt:15)
    at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:89)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:194)
    at io.ktor.util.pipeline.SuspendFunctionGun.access$resumeRootWith(SuspendFunctionGun.kt:15)
    at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:89)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:194)
    at io.ktor.util.pipeline.SuspendFunctionGun.access$resumeRootWith(SuspendFunctionGun.kt:15)
    at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:89)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:194)
    at io.ktor.util.pipeline.SuspendFunctionGun.access$resumeRootWith(SuspendFunctionGun.kt:15)
    at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:89)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:194)
    at io.ktor.util.pipeline.SuspendFunctionGun.access$resumeRootWith(SuspendFunctionGun.kt:15)
    at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:89)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:115)
    at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:103)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
    Suppressed: kotlinx.coroutines.internal.DiagnosticCoroutineContextException: [StandaloneCoroutine{Cancelling}@aa63dc4, Dispatchers.IO]
Caused by: java.net.ConnectException: Failed to connect to /127.0.0.1:13276
    at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:285)
    at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:195)
    at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:249)
    at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:108)
    at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:76)
    at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
2024-02-12 17:25:16.282  5650-6101  AndroidRuntime          tech.relaycorp.ping                  E      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
    at java.util.concurrent.ThreadPoolExecutor.processTask(ThreadPoolExecutor.java:1187)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.lang.Thread.run(Thread.java:784)
Caused by: java.net.ConnectException: failed to connect to /127.0.0.1 (port 13276) from /127.0.0.1 (port 38428) after 10000ms: isConnected failed: ECONNREFUSED (Connection refused)
    at libcore.io.IoBridge.isConnected(IoBridge.java:278)
    at libcore.io.IoBridge.connectErrno(IoBridge.java:187)
    at libcore.io.IoBridge.connect(IoBridge.java:129)
    at java.net.PlainSocketImpl.socketConnect(PlainSocketImpl.java:137)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:391)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:231)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:213)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:436)
    at java.net.Socket.connect(Socket.java:621)
    at okhttp3.internal.platform.AndroidPlatform.connectSocket(AndroidPlatform.kt:62)
    at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:283)
    ... 19 more
Caused by: android.system.ErrnoException: isConnected failed: ECONNREFUSED (Connection refused)
    at libcore.io.IoBridge.isConnected(IoBridge.java:267)
    ... 29 more

I feel we should be stopping the Public Sync as the Ping app goes into the background. I don't think Android likes apps in the background with internet connections and that's probably why we can no longer connect to the Public Gateway. We should releasing the GatewaySyncService, and re-binding on resume. That crash also needs getting looked into.

gnarea commented 7 months ago

That behaviour sounds familiar from my experience using Letro.

I feel we should be stopping the Public Sync as the Ping app goes into the background. I don't think Android likes apps in the background with internet connections and that's probably why we can no longer connect to the Public Gateway. We should releasing the GatewaySyncService, and re-binding on resume

That makes sense. Any undesirable side effects of doing that?

That crash also needs getting looked into

Agreed.

sdsantos commented 7 months ago

I feel we should be stopping the Public Sync as the Ping app goes into the background. I don't think Android likes apps in the background with internet connections and that's probably why we can no longer connect to the Public Gateway. We should releasing the GatewaySyncService, and re-binding on resume

That makes sense. Any undesirable side effects of doing that?

In theory, it's nice to keep the sync going a bit longer, to catch some parcels that might come in. But we shouldn't be doing it.

gnarea commented 7 months ago

In theory, it's nice to keep the sync going a bit longer, to catch some parcels that might come in. But we shouldn't be doing it.

Gotcha. Yeah, that sounds like an acceptable compromise.

What about the user switching between two Awala-apps in quick succession? For example, by using the "task bar" that allows you to switch to switch to one of the most recently-used apps (don't know the proper name for it), or by manually closing an app and starting the other.

Could this lead to concurrency issues? As in, the service is shutting down (and the HTTP server terminating) whilst the second app is opened and requests the service to be started.

sdsantos commented 7 months ago

In theory, it's nice to keep the sync going a bit longer, to catch some parcels that might come in. But we shouldn't be doing it.

Gotcha. Yeah, that sounds like an acceptable compromise.

What about the user switching between two Awala-apps in quick succession? For example, by using the "task bar" that allows you to switch to switch to one of the most recently-used apps (don't know the proper name for it), or by manually closing an app and starting the other.

Could this lead to concurrency issues? As in, the service is shutting down (and the HTTP server terminating) whilst the second app is opened and requests the service to be started.

It's the Gateway that manages when to start and stop the Public Sync. I guess if we want to minimize server restarts we can have a grace period of 30 seconds before shutting down the Public Sync in the Gateway app.

gnarea commented 7 months ago

I think that part is OK to shutdown immediately or shortly after the Awala-compatible app is in the background.

I was mainly thinking of the PoWeb server we run on 127.0.0.1. Presumably we'd have to shut that one down too when the app is sent to the background, so we'd need to ensure that any restart is done sequentially.

sdsantos commented 7 months ago

I think that part is OK to shutdown immediately or shortly after the Awala-compatible app is in the background.

I was mainly thinking of the PoWeb server we run on 127.0.0.1. Presumably we'd have to shut that one down too when the app is sent to the background, so we'd need to ensure that any restart is done sequentially.

I do know that if multiple apps bind to the GatewaySyncService, it stays running until all apps unbind. But now that we have 2 apps I'll test afterwards if there's any unwanted side-effect from switching between apps.

gnarea commented 7 months ago

Cool, thanks!

sdsantos commented 7 months ago

So, the Ping app needs to be changed to something like this:

    protected open fun bindToGateway() {
        ProcessLifecycleOwner.get().lifecycle.addObserver(object : DefaultLifecycleObserver {
            override fun onStart(owner: LifecycleOwner) {
                lifecycleScope.launch {
                    try {
                        GatewayClient.bind()
                        bootstrapData.bootstrapIfNeeded()
                    } catch (exp: GatewayBindingException) {
                        logger.log(Level.WARNING, "Gateway binding exception", exp)
                        openNoGateway()
                    }
                }
            }

            override fun onStop(owner: LifecycleOwner) {
                GatewayClient.unbind()
            }
        })
    }

I'm not sure this is something we want to incorporate in the SDK, since we have been leaving this up to developers. It would be something like GatewayClient.bindAutomaticallyOnAppStart(), but maybe it's not a priority.


The Gateway app also needs changes. We are always checking for connection state, even if the public sync is not running and the app is in the background. So I changed it to only check for Internet and Awala access if those conditions apply.


Both those changes fix the underlying issue, messages are received correctly. But now I'm facing the Netty crash occasionally, the one I had seen on https://github.com/relaycorp/relaynet-gateway-android/pull/727. I'm investigating how to work around that issue.

gnarea commented 7 months ago

Thanks for the update.

I'm not sure this is something we want to incorporate in the SDK, since we have been leaving this up to developers. It would be something like GatewayClient.bindAutomaticallyOnAppStart(), but maybe it's not a priority.

What's the downside to implementing it in the SDK? Would it prevent developers from doing something more advanced that they may want to do?

In principle, although it's not too much code, it isn't trivial and every app would need to do that anyway, so I'd rather implement it in the SDK unless that'd cause other issues.

Having said this, [ProcessLifecycleOwner](https://developer.android.com/reference/android/arch/lifecycle/ProcessLifecycleOwner is marked as deprecated), so maybe if the new approach is trivial we could keep it outside the SDK.

The Gateway app also needs changes. We are always checking for connection state, even if the public sync is not running and the app is in the background. So I changed it to only check for Internet and Awala access if those conditions apply.

I'm not 100% sure I understood what was changed, so just to double check: are you talking about the bit that makes a HEAD request to the PoWeb endpoint of Internet gateway and, failing that, requests HEAD https://google.com? If so, what exactly was changed?

sdsantos commented 7 months ago

Thanks for the update.

I'm not sure this is something we want to incorporate in the SDK, since we have been leaving this up to developers. It would be something like GatewayClient.bindAutomaticallyOnAppStart(), but maybe it's not a priority.

What's the downside to implementing it in the SDK? Would it prevent developers from doing something more advanced that they may want to do?

In principle, although it's not too much code, it isn't trivial and every app would need to do that anyway, so I'd rather implement it in the SDK unless that'd cause other issues.

The only downside I can think of is having 2 ways to bind to the client. Either manually with bind/unbind, or have it happen automatically with that autoBind.

Having said this, [ProcessLifecycleOwner](https://developer.android.com/reference/android/arch/lifecycle/ProcessLifecycleOwner is marked as deprecated), so maybe if the new approach is trivial we could keep it outside the SDK.

Here's the new version I'm using: androidx.lifecycle.ProcessLifecycleOwner

The Gateway app also needs changes. We are always checking for connection state, even if the public sync is not running and the app is in the background. So I changed it to only check for Internet and Awala access if those conditions apply.

I'm not 100% sure I understood what was changed, so just to double check: are you talking about the bit that makes a HEAD request to the PoWeb endpoint of Internet gateway and, failing that, requests HEAD https://google.com? If so, what exactly was changed?

Yup, that's the bit. The rule we had is: if the Gateway app is on the foreground, or the Ping/Letro app is bound to the Gateway app, and we can connect to the Public Gateway, then we start Public Sync. But we were checking those 3 rules always, meaning we were doing Pings even if no Awala app was in the foreground and we would never start the Public Sync.

What I want to change is to just start Pinging if the Gateway app is on the foreground, or the Ping/Letro app is bound to the Gateway app. Because why are we pinging then? And that's what was leading Android to shutdown our network connections, us doing background network calls with no app being in use (screen turned off).

gnarea commented 7 months ago

The only downside I can think of is having 2 ways to bind to the client. Either manually with bind/unbind, or have it happen automatically with that autoBind.

Gotcha 👌🏾

Since both Awala Ping and Letro would need this, we could only offer the autoBind option, and consider adding a different option if folks ask for it in the future.

Here's the new version I'm using: androidx.lifecycle.ProcessLifecycleOwner

Oh, I see. Same name, different class/package. 👍🏾

I'm not 100% sure I understood what was changed, so just to double check: are you talking about the bit that makes a HEAD request to the PoWeb endpoint of Internet gateway and, failing that, requests HEAD https://google.com? If so, what exactly was changed?

Yup, that's the bit. The rule we had is: if the Gateway app is on the foreground, or the Ping/Letro app is bound to the Gateway app, and we can connect to the Public Gateway, then we start Public Sync. But we were checking those 3 rules always, meaning we were doing Pings even if no Awala app was in the foreground and we would never start the Public Sync.

What I want to change is to just start Pinging if the Gateway app is on the foreground, or the Ping/Letro app is bound to the Gateway app. Because why are we pinging then? And that's what was leading Android to shutdown our network connections, us doing background network calls with no app being in use (screen turned off).

Ah, makes sense! Yeah, let's proceed with that change, please. 👍🏾

sdsantos commented 7 months ago

Finally I was able to fix the first part of the issue. Now to fix the endpoint side.

I'm thinking on how to implement the bindAutomatically() method. Before, we detected the gateway only when binding, throwing an exception if it's missing. But if we use bindAutomatically(), we would only figure out if we're missing the gateway when the app goes into the foreground. a) Should bindAutomatically return a stream of possible errors? b) Should we have a separate stream for client binding errors? c) Should I figure out a way to "query" for the service without binding, so we can return an error straight away? (Doesn't save us from the fact the Gateway app could be uninstalled in the meantime, which the endpoint app was running)

sdsantos commented 7 months ago

I went with the approach of having lambdas on the bindAutomatically call, that let the app developers react to each event: https://github.com/relaycorp/awala-endpoint-android/pull/372

gnarea commented 7 months ago

Thanks Sergio! That makes sense. I've just approved both PRs!

gnarea commented 7 months ago

@sdsantos, closing this now that we've merged the PRs for the gateway and Ping, but feel free to reopen if there's something left to do.

sdsantos commented 7 months ago

@gnarea nothing more. Let's just keep an eye on things from now on.