joffrey-bion / krossbow

A Kotlin multiplatform coroutine-based STOMP client over websockets, with built-in conversions.
MIT License
207 stars 16 forks source link

Software caused connection abort on network disconnection #173

Closed Shusek closed 5 days ago

Shusek commented 2 years ago

Describe the bug

During a websocket connection when user turn off wifi connection he got "Software caused connection abort" exception. This exception is uncatchable so it cause application crash. I'm debugging which component is causing this error but it's pretty hard if anyone could confirm its occurrence it was great.

Reproduction and additional details 1) Connect to stomp by secure protocol 2) Turn off wifi 3) Crash occured

Context

Artifacts used:

Shusek commented 2 years ago

After debugging, I discovered the culprit. Inside KtorWebSocketConnectionAdapter when anyone tries to use outgoing the channel (eg. sendText method) when network connection is closing this will raise an exception.

Overall, this is consistent with the documentation:

Enqueue frame, may suspend if outgoing queue is full. May throw an exception if outgoing channel is already closed so it is impossible to transfer any message. Frames that were sent after close frame could be silently ignored. Please note that close frame could be sent automatically in reply to a peer close frame unless it is raw websocket session.

Summarizing the Krossbow should catch such errors or not allow their occurrence.

joffrey-bion commented 2 years ago

Thanks a lot for the report, and for the investigation.

Inside KtorWebSocketConnectionAdapter when anyone tries to use outgoing the channel (eg. sendText method) when network connection is closing this will raise an exception.

This seems appropriate.

Summarizing the Krossbow should catch such errors or not allow their occurrence.

I'm not sure I'm following. It seems legitimate to me that trying to send frames on a closed session throws an exception. Do you mean that an exception can occur due to automatically sent frames without programmer error?

Shusek commented 2 years ago

I include sample code:


    override fun onViewCreated(view: View, savedInstanceState: Bundle?) {
        viewLifecycleOwner.lifecycleScope.launch {
            try {
                provideStop().connect(
                    BuildConfig.SOCKET_URL,
                    host = null
                ).withJsonConversions(json = JsonParser)

                // disconnect wifi connection

                delay(30000)
            } catch (throwable: Throwable) {
                // the error is not caught 
            }
        }
    }

The reason is sending hearthbeat after socket is aborted and it causes the error:

java.net.SocketException: Software caused connection abort
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:119)
    at java.net.SocketInputStream.read(SocketInputStream.java:176)
    at java.net.SocketInputStream.read(SocketInputStream.java:144)
    at okio.InputStreamSource.read(JvmOkio.kt:93)
    at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:125)
    at okio.RealBufferedSource.request(RealBufferedSource.kt:206)
    at okio.RealBufferedSource.require(RealBufferedSource.kt:199)
    at okio.RealBufferedSource.readByte(RealBufferedSource.kt:209)
    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.kt:119)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.kt:102)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.kt:293)
    at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:195)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.lang.Thread.run(Thread.java:923)

I/Process: Sending signal. PID: 23274 SIG: 9

In my opinion, heartbeat should not be sent when socket is corrupted. To catch such exceptions for this moment i can use CoroutineExceptionHandler inside launch method.

joffrey-bion commented 2 years ago

Got your point, thanks a lot for the sample code. It is indeed incorrect for Krossbow to try to send heart beats when the session is already closed. I will look into that.

The try/catch around the connect call only works for catching exceptions during the connection. Once the session is connected, other exceptions will be thrown through subscription flows, or when calling session methods like send. They can also be accessed via an instrumentation if one is defined in the Krossbow config, if you need a global way of getting "informed". So this part is kind of expected.

Shusek commented 2 years ago

The worst of this bug is it that reconnect mechanism does not recognize such a case. Even when you call withAutoReconnect and something will cut your wifi connection AutoReconnect it will not work.

joffrey-bion commented 2 years ago

I'm confused, does this happen with Ktor or OkHttp? The stack trace points to OkHttp (without any Krossbow methods), but you mention Ktor's outgoing channel.

Also, it'd be great to be able to reproduce this outside of Android's environment, so we could write a test case for it.

Last but not least, do you reproduce the issue in Krossbow 3.1.0?

Shusek commented 2 years ago

Yes, it is reproducible on 3.1.0. KTOR library works like a proxy and has a lot net engine inside (OkHttp, Jetty etc.). In my case i use okHttp like that

val ktorClient = HttpClient(OkHttp) {
            install(WebSockets)
}

But for me it shouldn't matter, probably it can be reproduced on the plain OkHTTP without ktor. sendText is called by HearthBeat action sendHeartBeat and exception invokes in this method:

    override suspend fun sendText(frameText: String) {
        wsSession.outgoing.send(Frame.Text(frameText))
    }

If I disable HearthBeat action SocketException in this case it does not appear anymore.

Simplest hack-workaround for that is global handler can also be added and ignore this exception like that:

         val webSocketClient =
            KtorWebSocketClient(httpClient).withAutoReconnect(
                ReconnectConfig(
                    maxAttempts = Int.MAX_VALUE
)
            )
        val client = StompClient(webSocketClient, StompConfig().apply {
            heartBeat = HeartBeat(5.seconds, 13.seconds)
        })
        val errorHandler = CoroutineExceptionHandler { _, exception ->
            ErrorLogger.logError(exception)
        }
        return client.connect(
            url,
            host = null,
            sessionCoroutineContext = EmptyCoroutineContext + errorHandler
        ).withJsonConversions(json = JsonParser)
joffrey-bion commented 2 years ago

Thanks a lot for the details. Ok so you're using krossbow-werbsocket-ktor with Ktor client using OkHttp engine, thanks.

sendText is called by HearthBeat action sendHeartBeat and exception invokes in this method

If you confirmed this by debugging then that's ok, thanks. I just didn't get why the stacktrace you provided wouldn't include that. Indeed if an error on the socket occurs, the heart beat shouldn't be sent. I can think of 2 possibilities where this could happen in 3.1.0 (maybe there are more):

  1. a heart beat tick happens between the actual websocket error/closure and the cancellation of the heartbeat job
  2. the websocket implementation doesn't bubble up the error at all (Ktor's incoming frames channel doesn't throw) - IMO this would be a bug in the websocket implementation

To prevent the first one, we could cancel the heart beat job early, but that would still be racing with the error bubbling. So it would still be possible that the heart beat kicks in while the error callbacks haven't reached the point where the session is aborted and all internal coroutines cancelled.

In order to prevent heart beat failures, we could check if the websocket is still open before attempting to send a heart beat (canSend). In that case we probably also want to shutdown the session if it can't accept heart beats, in order to fix the problem 2 as well.

As a side note, both StompClient's constructor and withReconnectConfig accept lambdas for the configuration, so you don't have to write Config().apply { ... } yourself:

val webSocketClient = KtorWebSocketClient(httpClient).withAutoReconnect {
    maxAttempts = Int.MAX_VALUE
}

val client = StompClient(webSocketClient) {
    heartBeat = HeartBeat(5.seconds, 13.seconds)
}
mahmed1987 commented 5 days ago

hello @joffrey-bion .

I am facing this exact same issue in version 7.0.0

This was resolved ?

I have my heartbeats set and I was testing a few things in my android application. My backend is using spring boot I took my application to the background for a while and put my cellphone to sleep mode.

I rightfully show on the logs of my spring boot backend that the heartbeats stopped appearing and the server destroyed my session.

When I bought my application back to live , the application crashed with the "Software caused connection abort".

I was able to hit this breakpoint before the app crashed /Users/ahmed/.gradle/caches/modules-2/files-2.1/org.hildan.krossbow/krossbow-stomp-core-iosarm64/7.0.0/aa21a035596090a27ec803e96eb75a6a72dd104a/krossbow-stomp-core-iosarm64-7.0.0-sources.jar!/commonMain/org/hildan/krossbow/stomp/BaseStompSession.kt:83

private suspend fun shutdown(message: String, cause: Throwable? = null) {
        // cancel heartbeats immediately to limit the chances of sending a heartbeat to a closed socket
        heartBeaterJob?.cancel()
        // let other subscribers handle the error/closure before cancelling the scope
        awaitSubscriptionsCompletion()
        scope.cancel(message, cause = cause)
    }

The app control was at " heartBeaterJob?.cancel()" after which it crashed

image image
joffrey-bion commented 5 days ago

Yes this should have been fixed a long time ago. Thanks for reporting that it happens again, and for the details to help reproduce it. I reopened this issue to have another look.

Could you please share a little more details about which artifacts you were using? Are you using Ktor with the OkHttp engine? Or OkHttp directly?

mahmed1987 commented 5 days ago

Oh yes , I am actually using Ktor engine .

joffrey-bion commented 5 days ago

Thanks! And did you have a chance to try this experiment with the latest version of Krossbow?

mahmed1987 commented 5 days ago

yes indeed, my krossbow version is krossbow-version = "7.0.0"

joffrey-bion commented 5 days ago

If it's not too much hassle, could you please confirm that you can still reproduce with Krossbow 8.0.0? I don't think the changes in Krossbow 8 should change this specific behaviour, but I'd just like to be sure to avoid investigating a ghost :)

mahmed1987 commented 5 days ago

@joffrey-bion that is correct. Let me test this out with version 8 and revert back to you

mahmed1987 commented 5 days ago

@joffrey-bion surprisingly , that issue is not appearing in 8.0.0 despite your comment that you didn't change anything that was relevant to this issue .

Lets chalk this one up to some random act of God :D . You can close the issue while I monitor more closely and would revert back to you if i see some unexpected behavior .

Thanks again

joffrey-bion commented 5 days ago

Thank you so much for testing this! Ok I'll close for now, and we'll reopen if it pops up again.