smithy-lang / smithy-kotlin

Smithy code generator for Kotlin (in development)
Apache License 2.0
76 stars 26 forks source link

OkHttp: Retry on SocketException: Broken pipe #962

Open berlix opened 9 months ago

berlix commented 9 months ago

Describe the bug

I am aware of the fix in #896. However, even using SDK versions that include the fix (AWS Kotlin SDK 0.32.0-beta), we run into frequent, unhandled networking exceptions when talking to just about any AWS service, and especially S3.

Expected Behavior

To retry the request, so that application code doesn't have to.

Current Behavior

The request is not retried, and the exception is propagated up to the application:

java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(SocketOutputStream.java)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:150)
    at sun.security.ssl.SSLSocketOutputRecord.deliver(SSLSocketOutputRecord.java:345)
    at sun.security.ssl.SSLSocketImpl$AppOutputStream.write(SSLSocketImpl.java:1309)
    at okio.OutputStreamSink.write(JvmOkio.kt:57)
    at okio.AsyncTimeout$sink$1.write(AsyncTimeout.kt:102)
    at okio.RealBufferedSink.emitCompleteSegments(RealBufferedSink.kt:255)
    at okio.RealBufferedSink.write(RealBufferedSink.kt:146)
    at okhttp3.internal.http1.Http1ExchangeCodec$KnownLengthSink.write(Http1ExchangeCodec.kt:278)
    at okio.ForwardingSink.write(ForwardingSink.kt:29)
    at okhttp3.internal.connection.Exchange$RequestBodySink.write(Exchange.kt:223)
    at okio.RealBufferedSink.emitCompleteSegments(RealBufferedSink.kt:255)
    at okio.RealBufferedSink.write(RealBufferedSink.kt:146)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.InstrumentedSink.write(MetricsInterceptor.kt:49)
    at okio.RealBufferedSink.emitCompleteSegments(RealBufferedSink.kt:255)
    at okio.RealBufferedSink.write(RealBufferedSink.kt:185)
    at okhttp3.internal._RequestBodyCommonKt$commonToRequestBody$1.writeTo(-RequestBodyCommon.kt:36)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.InstrumentedRequestBody.writeTo(MetricsInterceptor.kt:69)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:64)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:84)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:65)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.MetricsInterceptor.intercept(MetricsInterceptor.kt:30)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:205)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:537)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(Thread.java:829)

aws.smithy.kotlin.runtime.http.HttpException: java.net.SocketException: Broken pipe (Write failed)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine.roundTrip(OkHttpEngine.kt:158)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine$roundTrip$1.invokeSuspend(OkHttpEngine.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)

As a side note: we tried switching to the CRT client, but there we've also observed at least one instance of an exception that looks retryable:

aws.smithy.kotlin.runtime.http.HttpException: AWS_ERROR_HTTP_CONNECTION_CLOSED: The connection has closed or is closing.; crtErrorCode=2058 
    at aws.smithy.kotlin.runtime.http.engine.crt.SdkStreamResponseHandler.onResponseComplete(SdkStreamResponseHandler.kt:176)
    at aws.sdk.kotlin.crt.http.HttpRequestUtilKt$asJniStreamResponseHandler$1.onResponseComplete(HttpRequestUtil.kt:59)
    at software.amazon.awssdk.crt.http.HttpStreamResponseHandlerNativeAdapter.onResponseComplete(HttpStreamResponseHandlerNativeAdapter.java:58)

Steps to Reproduce

We observe this mostly (but not only) in long-running tasks that make many concurrent requests to AWS services.

Possible Solution

To consider java.net.SocketException retryable.

Your Environment

aajtodd commented 9 months ago

Thanks for the issue.

we run into frequent, unhandled networking exceptions when talking to just about any AWS service, and especially S3.

I'd be interested in understanding what types of exceptions you are seeing. Are they all the same as the one you included in this issue (e.g. java.net.SocketException: Broken pipe (Write failed)) or are there others? Do they differ by service?


To consider java.net.SocketException retryable.

The issue is that knowing if a network exception is retryable generically for any operation depends on whether the operation is idempotent. Take the example given, you hit something like a socket closed issue (perhaps the server closed it or there was some other network issue). What if this failure happened after the server modified state? How do we know it's safe to retry at that point?

The answer is we don't (unless the operation is marked as @idempotent). We don't special case idempotent operations yet, that would be a good addition/feature request.

It's possible we may be able to special case S3's retry policy to be more relaxed but again we'd need a good way to classify which networking errors are retryable.


In the meantime you can always customize the retry policy for the SDK to retry whatever kind of error you want. This doesn't require changing the RetryStrategy, just the policy for what errors are considered retryable.

e.g.

class CustomRetryPolicy : RetryPolicy<Any?> {
    override fun evaluate(result: Result<Any?>): RetryDirective {
        return when(result.exceptionOrNull()) {
            is IOException -> RetryDirective.RetryError(RetryErrorType.Transient)
            else -> AwsDefaultRetryPolicy.evaluate(result)
        }
    }
}

fun main(): Unit = runBlocking {

    val s3 = S3Client.fromEnvironment {
        retryPolicy = CustomRetryPolicy()
    }

    ...

Note: You'll need a dependency on aws.sdk.kotlin:aws-http:<sdk-version> to access AwsDefaultRetryPolicy

berlix commented 9 months ago

Thank you for your response. I see the problem with non-idempotent requests and that not all errors (or all kinds of SocketException) should be considerend retryable.

And thanks for the hint about the RetryPolicy, we'll try that.

I'd be interested in understanding what types of exceptions you are seeing

I don't have good stats right now, because we also use some older SDK versions in some services, but I believe that from the recent versions, the SocketException with Broken pipe (Write failed) is the one we see the most by far, mostly from S3 and sometimes from CloudWatch (PutMetrics). I see you already use some hacks where you check the exception's message to determine retryability - it seems to me that "Write failed" does qualify for retryability, so one idea could be to add that as a discriminator.

One thing we also see regularly, but I suppose it's unrelated and just a bug in OkHttp, is IllegalStateException: Unbalanced enter/exit from inside OkHttp. That was why tried switching to the CRT client. Now the connection-related exception from the CRT client (as seen in my initial comment) doesn't seem to be very specific, unfortunately, and doesn't say whether it was a write that failed or a read.

ianbotsf commented 3 months ago

Hi @berlix, are you still encountering socket exceptions when using OkHttp with more recent versions of the SDK?

madisp commented 3 months ago

Googling aws-kotlin-sdk "Unbalanced enter/exit" brought me here :)

@ianbotsf I'm getting Caused by: java.lang.IllegalStateException: Unbalanced enter/exit when spawning multiple SQS consumers, test: https://gist.github.com/madisp/c3c1e04ad0ef7c34ed66b5545606c846. Seems to reliably happen on my local m1 macbook when there are 10 or more parallel consumers.

(this is on aws.sdk.kotlin:sqs-jvm:1.1.10)

This might be a test coroutines issue though?

ianbotsf commented 3 months ago

Thanks for the test code @madisp but I cannot reproduce the failure on my local Windows machine or my cloud EC2 Linux machine. It's likely that parallelism and timings are making this difficult to reproduce in different executing environments.

I see that OkHttp recently fixed an unbalanced enter/exit exception caused by rapid release/re-acquisition of connections (https://github.com/square/okhttp/issues/7381). If you have stable repro code, it would be interesting to see if it still fails on OkHttp 5.0.0-alpha.12 (smithy-kotlin currently uses 5.0.0-alpha.11). If not then we can prioritize upgrading the OkHttp dependency to the latest alpha version.

madisp commented 3 months ago

Yup, still getting with 5.0.0-alpha.12. One thing that makes it easier to reproduce is the number of MESSAGES and CONSUMERS - can you try increasing these to 1000 and 100 respectively?

berlix commented 3 months ago

Apologies for the late response. We haven't observed that specific exception ("Broken pipe") with SDK versions 1.0.13 or 1.1.13. We do still regularly get the following exception, at least when downloading objects from S3 - no clue if it's related:

aws.smithy.kotlin.runtime.http.HttpException: java.io.IOException: unexpected end of stream on https://s3.eu-west-1.amazonaws.com/...
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine.roundTrip(OkHttpEngine.kt:158)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine$roundTrip$1.invokeSuspend(OkHttpEngine.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
Caused by: java.io.IOException: unexpected end of stream on https://s3.eu-west-1.amazonaws.com/...
    at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:209)
    at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:111)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:95)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:84)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:65)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.MetricsInterceptor.intercept(MetricsInterceptor.kt:30)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:205)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:537)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.EOFException: \n not found: limit=0 content=…
    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:335)
    at okhttp3.internal.http1.HeadersReader.readLine(HeadersReader.kt:29)
    at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:179)
    ... 18 common frames omitted