relaycorp / awala-poweb-jvm

JVM implementation of the Awala PoWeb binding
MIT License
1 stars 1 forks source link

`SocketTimeoutException`s from WebSockets connections isn't handled #302

Open gnarea opened 1 year ago

gnarea commented 1 year ago

The following doesn't appear to have any effect (SocketTimeoutException is a subclass of IOException):

https://github.com/relaycorp/awala-poweb-jvm/blob/380529fb1028ff22dc8da843223d66a9e90a26f7/src/main/kotlin/tech/relaycorp/poweb/PoWebClient.kt#L325-L327

Because Letro is crashing when the connection to the Awala server stops responding to pings:

E  FATAL EXCEPTION: AwalaManagerMessageReceiverThread
Process: tech.relaycorp.letro, PID: 14200
java.net.SocketTimeoutException: sent ping but didn't receive pong within 5000ms (after 1 successful ping/pongs)
    at okhttp3.internal.ws.RealWebSocket.writePingFrame$okhttp(RealWebSocket.kt:563)
    at okhttp3.internal.ws.RealWebSocket$initReaderAndWriter$lambda-3$$inlined$schedule$1.runOnce(TaskQueue.kt:219)
    at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.kt:116)
    at okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.kt:42)
    at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:65)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.lang.Thread.run(Thread.java:923)
    Suppressed: kotlinx.coroutines.internal.DiagnosticCoroutineContextException: [StandaloneCoroutine{Cancelling}@a67a425, java.util.concurrent.ScheduledThreadPoolExecutor@bc259fa[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 4]]
sdsantos commented 1 year ago

From what I reproduced, it was actually working and wrapping in a ServerConnectionException. But now this requirements changed, since we added separate handling for SocketTimeoutException.

gnarea commented 1 year ago

Indeed. Though it's interesting that wsConnect() isn't in the stack trace, so I guess that's why we got the uncaught exception. Maybe it's something we need to handle at the flow level?

sdsantos commented 1 year ago

Yup, it should be something like that. I'll investigate tomorrow.

sdsantos commented 1 year ago

From what I can tell it's being handled at the Flow level. Any steps to reproduce this issue? I haven't been able to.

I also noticed that the PDC interface is outdated. These throws come from inside the Flow, not from the method call: https://github.com/relaycorp/awala-jvm/blob/bb426493379c605eec0a0c628f8c1dab82f5f6ad/src/main/kotlin/tech/relaycorp/relaynet/bindings/pdc/PDCClient.kt#L54

gnarea commented 1 year ago

From what I can tell it's being handled at the Flow level. Any steps to reproduce this issue? I haven't been able to.

Handled at the flow level by whom?

Not even disconnecting the router from the Internet did the trick?

I haven't experienced this myself. @mgulyaev10, any idea on how to reproduce this? I suspect it may be intermittent and happen rarely.

I also noticed that the PDC interface is outdated. These throws come from inside the Flow, not from the method call: https://github.com/relaycorp/awala-jvm/blob/bb426493379c605eec0a0c628f8c1dab82f5f6ad/src/main/kotlin/tech/relaycorp/relaynet/bindings/pdc/PDCClient.kt#L54

Thanks! Fixed in https://github.com/relaycorp/awala-jvm/pull/300

sdsantos commented 1 year ago

From what I can tell it's being handled at the Flow level. Any steps to reproduce this issue? I haven't been able to.

Handled at the flow level by whom?

Not even disconnecting the router from the Internet did the trick?

That only triggered a wrapped ServerConnectionException that is being handled.

sdsantos commented 11 months ago

Looks like KTOR as an history of uncatchable exceptions that don't reference the app source code. Here's at least once that mentions SocketTimeoutException, although with a different stacktrace: https://youtrack.jetbrains.com/issue/KTOR-577/java.net.SocketTimeoutException-with-no-lines-referencing-my-code

Since our stacktraces also don't mention our own source code, we might not be able to handle the exception regularly. Maybe we can fix the underlying issue causing the timeout, which might not be possible. Other solutions look to be pretty last resort like CoroutineExceptionHandler or Thread.setDefaultUncaughtExceptionHandler, which don't sound great.

gnarea commented 11 months ago

Thanks for looking into this @sdsantos!

That doesn't look good :(

I guess even if we wanted to use CoroutineExceptionHandler or Thread.setDefaultUncaughtExceptionHandler, we still wouldn't know for sure if that SocketTimeoutException came from our Ktor client anyway, so it wouldn't really work.

sdsantos commented 11 months ago

I guess even if we wanted to use CoroutineExceptionHandler or Thread.setDefaultUncaughtExceptionHandler, we still wouldn't know for sure if that SocketTimeoutException came from our Ktor client anyway, so it wouldn't really work.

At least with CoroutineExceptionHandler you can wrap it around a specific part of the code. But it could be catching more than we want yes.

gnarea commented 11 months ago

That's a bit better then, but I'd be concerned with the code complexity to restart/end the WebSocket connection when that happens.

Since this has only happened once in Letro in the past 90 days, let's put a pin on this for now and revisit if it happens more often. Hopefully that'll buy enough time until Jet Brains fix this.