Closed gnarea closed 11 months ago
Before I start my investigation, are there any clues regarding where the chain is broken?
Regarding the connection state when you open the Awala app, it may be a separate issue or not, but I'll investigate too.
Thanks for looking into this, @sdsantos!
When this has happened to me, I can see that the Internet gateway (frankfurt.relaycorp.cloud
) receives the parcels that should be collected by the Android gateway. However, I don't see any logs corresponding to the Android gateway attempting to collect parcels... Until I kill Awala and Letro, and open Letro again -- then it works as expected.
BTW, I forgot to mention this before, but at least when it happens to me, it tends to be more frequent when my connectivity is patchy.
Also, parcel delivery doesn't seem to be problematic in any way, as far as I remember when I've checked the logs. It's only parcel collection (which is why I wanted to get the PoWeb logs working).
I'll review the code for the Public Sync. It seems that the Gateway app is giving up on collecting public parcels, due to some unknown failure, and doesn't recover by itself.
Regarding opening the Gateway app and seeing disconnected, on my physical device it can take up around 3 seconds to assess the connection with the public gateway. Half the time is for resolving the PoWeb address and half for the HEAD request. Subsequent checks average 100ms.
(We could have a "Checking" state so this wouldn't scare users opening the app)
But if the Letro app is already running, the Awala app should already be doing this check in the background by itself. So, opening the Awala app when Letro is running should not be showing the Disconnect screen.
Public Sync Details
Public Sync Resilience
The PoWebClient restarts the web-socket connection with the public gateway if it failed due to "Internal Error (1011)". It restarts using method recurrence, which is probably not ideal, because it will fail eventually, in case of too many Internal Errors. But that would probably be a noticeable crash.
If the web-socket connection fails due to another reason, then we don't restart the Public Sync, unless there's a change in the state of the internet/public-gateway connection, change in the Gateway app being foreground or background, or an app unbinds and rebinds to the Gateway app. If this is the issue, then instead of killing Letro + Awala apps, turning off and on the internet connection may be enough to restart the Public Sync. Nevertheless, we may want to add another layer of resilience here, or make the restart work for other errors besides Internal Error
.
it can take up around 3 seconds to assess the connection with the public gateway
I haven't experienced that (it's usually a fraction of a second for me), but I can definitely see that happening, with the DNS query over HTTPS and subsequent HTTP request.
(We could have a "Checking" state so this wouldn't scare users opening the app)
Agreed. Created #712.
It restarts using method recurrence, which is probably not ideal, because it will fail eventually, in case of too many Internal Errors
Are you talking about the repeat(Int.MAX_VALUE)
in wsConnect()
? Yeah, I don't like it either, but I wasn't sure how to implement it in a more reliable and idiomatic manner (suggestions welcomed).
If this is the issue, then instead of killing Letro + Awala apps, turning off and on the internet connection may be enough to restart the Public Sync
Interesting. Will try that next time it happens.
Nevertheless, we may want to add another layer of resilience here, or make the restart work for other errors besides
Internal Error
Yeah, I suspect that's the culprit. But we already handle EOFException
s, IOException
s and their subclasses, and because the app isn't crashing, I guess all exceptions are handled -- the question is whether they're handled correctly.
Are you talking about the
repeat(Int.MAX_VALUE)
inwsConnect()
? Yeah, I don't like it either, but I wasn't sure how to implement it in a more reliable and idiomatic manner (suggestions welcomed).
My mistake, I was looking at the old version of the PoWebClient that the Gateway app is still using in the letro-debugging
branch, and it had method recurrence there. Now it's that infinite loop, which is a better option at least. I'm going to double check the latest Gateway app is using the right version of the PoWebClient and continue investigating.
For the version of the PoWebClient that Gateway app is currently using, the scenarios that can stop a sync in a way that only a network change will resume are:
InvalidMessageException
wrapped into a ServerBindingException
;EOFException
;NORMAL (1000)
reason code.I feel we should still try to restart the connection for those scenarios, even if using a longer delay like 10 seconds.
I'm going to try the Letro + Gateway apps using the new PoWebClient logs, to see if I spot one of the scenarios above.
I was able to reproduce a disconnection that wasn't recovered. The Gateway log:
Could not collect parcels due to server error
tech.relaycorp.relaynet.bindings.pdc.ServerConnectionException: Server is unreachable
at tech.relaycorp.poweb.PoWebClient.wsConnect$poweb(PoWebClient.kt:323)
at tech.relaycorp.poweb.PoWebClient$wsConnect$1.invokeSuspend(Unknown Source:19)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:115)
at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:103)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
Caused by: java.net.SocketTimeoutException: sent ping but didn't receive pong within 5000ms (after 3 successful ping/pongs)
at okhttp3.internal.ws.RealWebSocket.writePingFrame$okhttp(RealWebSocket.kt:563)
at okhttp3.internal.ws.RealWebSocket$initReaderAndWriter$$inlined$synchronized$lambda$1.runOnce(TaskQueue.kt:219)
at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.kt:116)
at okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.kt:42)
at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:65)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:919)
Turning off and turning back on the Wifi solved the issue, and the message was received.
Nice! 👏🏾 So it looks like our suspicion was well-founded. 😄
Possible solutions I can think of:
a) Handle SocketTimeoutException
as a EOFException
exception (delay and retry, instead of throwing)
b) Handle all IOException
s as a EOFException
exception (delay and retry, instead of throwing)
c) Add another restart logic one layer above, in the CollectParcelsFromGateway
or PublicSync
, if the issue we get is a ServerException
How about we handle SocketTimeoutException
similarly to EOFException
(delay and retry), but with a shorter delay? Like 500ms?
I think we're only getting the SocketTimeoutException
when we suddenly stop getting pings from the WebSockets server, so it means we were connected to it <=10 seconds ago.
To elaborate on the alternatives:
I think it'd still be good to have a higher delay with EOFException
s because those seem only be caused by server errors, so it'd be good to give the server a bit more time to recover.
IIRC from previous testing, IOException
is only thrown when we're disconnected from the Internet or the host is down/misconfigured, so I think in that case it'd be good to keep the current functionality so the user can continue to get the screen that says they're disconnected.
Here's the PR: https://github.com/relaycorp/awala-poweb-jvm/pull/306
:tada: This issue has been resolved in version 1.7.5 :tada:
The release is available on:
v1.7.5
Your semantic-release bot :package::rocket:
We experience this too often whilst testing Letro: Outgoing messages are always sent, but incoming messages are not received.
When this happens and we check the Awala app, it normally says that we're disconnected from the Internet, but it's not uncommon to see that we're connected to Awala via the Internet. Either way, the logs don't show anything relevant.
I suspect the PoWeb library may be the culprit, but we're not seeing its logs due to #709.