relaycorp / relaynet-gateway-android

Awala Gateway for Android
https://relaynet.network/users
GNU General Public License v3.0
9 stars 0 forks source link

App frequently disconnects from the Internet gateway #711

Closed gnarea closed 11 months ago

gnarea commented 12 months ago

We experience this too often whilst testing Letro: Outgoing messages are always sent, but incoming messages are not received.

When this happens and we check the Awala app, it normally says that we're disconnected from the Internet, but it's not uncommon to see that we're connected to Awala via the Internet. Either way, the logs don't show anything relevant.

I suspect the PoWeb library may be the culprit, but we're not seeing its logs due to #709.

sdsantos commented 11 months ago

Before I start my investigation, are there any clues regarding where the chain is broken?

Regarding the connection state when you open the Awala app, it may be a separate issue or not, but I'll investigate too.

gnarea commented 11 months ago

Thanks for looking into this, @sdsantos!

When this has happened to me, I can see that the Internet gateway (frankfurt.relaycorp.cloud) receives the parcels that should be collected by the Android gateway. However, I don't see any logs corresponding to the Android gateway attempting to collect parcels... Until I kill Awala and Letro, and open Letro again -- then it works as expected.

BTW, I forgot to mention this before, but at least when it happens to me, it tends to be more frequent when my connectivity is patchy.

Also, parcel delivery doesn't seem to be problematic in any way, as far as I remember when I've checked the logs. It's only parcel collection (which is why I wanted to get the PoWeb logs working).

sdsantos commented 11 months ago

I'll review the code for the Public Sync. It seems that the Gateway app is giving up on collecting public parcels, due to some unknown failure, and doesn't recover by itself.

sdsantos commented 11 months ago

Regarding opening the Gateway app and seeing disconnected, on my physical device it can take up around 3 seconds to assess the connection with the public gateway. Half the time is for resolving the PoWeb address and half for the HEAD request. Subsequent checks average 100ms.

(We could have a "Checking" state so this wouldn't scare users opening the app)

But if the Letro app is already running, the Awala app should already be doing this check in the background by itself. So, opening the Awala app when Letro is running should not be showing the Disconnect screen.

sdsantos commented 11 months ago

Public Sync Details

Public Sync Resilience

The PoWebClient restarts the web-socket connection with the public gateway if it failed due to "Internal Error (1011)". It restarts using method recurrence, which is probably not ideal, because it will fail eventually, in case of too many Internal Errors. But that would probably be a noticeable crash.

If the web-socket connection fails due to another reason, then we don't restart the Public Sync, unless there's a change in the state of the internet/public-gateway connection, change in the Gateway app being foreground or background, or an app unbinds and rebinds to the Gateway app. If this is the issue, then instead of killing Letro + Awala apps, turning off and on the internet connection may be enough to restart the Public Sync. Nevertheless, we may want to add another layer of resilience here, or make the restart work for other errors besides Internal Error.

gnarea commented 11 months ago

it can take up around 3 seconds to assess the connection with the public gateway

I haven't experienced that (it's usually a fraction of a second for me), but I can definitely see that happening, with the DNS query over HTTPS and subsequent HTTP request.

(We could have a "Checking" state so this wouldn't scare users opening the app)

Agreed. Created #712.

It restarts using method recurrence, which is probably not ideal, because it will fail eventually, in case of too many Internal Errors

Are you talking about the repeat(Int.MAX_VALUE) in wsConnect()? Yeah, I don't like it either, but I wasn't sure how to implement it in a more reliable and idiomatic manner (suggestions welcomed).

If this is the issue, then instead of killing Letro + Awala apps, turning off and on the internet connection may be enough to restart the Public Sync

Interesting. Will try that next time it happens.

Nevertheless, we may want to add another layer of resilience here, or make the restart work for other errors besides Internal Error

Yeah, I suspect that's the culprit. But we already handle EOFExceptions, IOExceptions and their subclasses, and because the app isn't crashing, I guess all exceptions are handled -- the question is whether they're handled correctly.

sdsantos commented 11 months ago

Are you talking about the repeat(Int.MAX_VALUE) in wsConnect()? Yeah, I don't like it either, but I wasn't sure how to implement it in a more reliable and idiomatic manner (suggestions welcomed).

My mistake, I was looking at the old version of the PoWebClient that the Gateway app is still using in the letro-debugging branch, and it had method recurrence there. Now it's that infinite loop, which is a better option at least. I'm going to double check the latest Gateway app is using the right version of the PoWebClient and continue investigating.

sdsantos commented 11 months ago

For the version of the PoWebClient that Gateway app is currently using, the scenarios that can stop a sync in a way that only a network change will resume are:

I feel we should still try to restart the connection for those scenarios, even if using a longer delay like 10 seconds.


I'm going to try the Letro + Gateway apps using the new PoWebClient logs, to see if I spot one of the scenarios above.

sdsantos commented 11 months ago

I was able to reproduce a disconnection that wasn't recovered. The Gateway log:

Could not collect parcels due to server error
tech.relaycorp.relaynet.bindings.pdc.ServerConnectionException: Server is unreachable
    at tech.relaycorp.poweb.PoWebClient.wsConnect$poweb(PoWebClient.kt:323)
    at tech.relaycorp.poweb.PoWebClient$wsConnect$1.invokeSuspend(Unknown Source:19)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:115)
    at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:103)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
Caused by: java.net.SocketTimeoutException: sent ping but didn't receive pong within 5000ms (after 3 successful ping/pongs)
    at okhttp3.internal.ws.RealWebSocket.writePingFrame$okhttp(RealWebSocket.kt:563)
    at okhttp3.internal.ws.RealWebSocket$initReaderAndWriter$$inlined$synchronized$lambda$1.runOnce(TaskQueue.kt:219)
    at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.kt:116)
    at okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.kt:42)
    at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:65)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.lang.Thread.run(Thread.java:919)

Turning off and turning back on the Wifi solved the issue, and the message was received.

gnarea commented 11 months ago

Nice! 👏🏾 So it looks like our suspicion was well-founded. 😄

sdsantos commented 11 months ago

Possible solutions I can think of: a) Handle SocketTimeoutException as a EOFException exception (delay and retry, instead of throwing) b) Handle all IOExceptions as a EOFException exception (delay and retry, instead of throwing) c) Add another restart logic one layer above, in the CollectParcelsFromGateway or PublicSync, if the issue we get is a ServerException

gnarea commented 11 months ago

How about we handle SocketTimeoutException similarly to EOFException (delay and retry), but with a shorter delay? Like 500ms?

I think we're only getting the SocketTimeoutException when we suddenly stop getting pings from the WebSockets server, so it means we were connected to it <=10 seconds ago.

gnarea commented 11 months ago

To elaborate on the alternatives:

I think it'd still be good to have a higher delay with EOFExceptions because those seem only be caused by server errors, so it'd be good to give the server a bit more time to recover.

IIRC from previous testing, IOException is only thrown when we're disconnected from the Internet or the host is down/misconfigured, so I think in that case it'd be good to keep the current functionality so the user can continue to get the screen that says they're disconnected.

sdsantos commented 11 months ago

Here's the PR: https://github.com/relaycorp/awala-poweb-jvm/pull/306

github-actions[bot] commented 11 months ago

:tada: This issue has been resolved in version 1.7.5 :tada:

The release is available on:

Your semantic-release bot :package::rocket: