Open berlix opened 9 months ago
Thanks for the issue.
we run into frequent, unhandled networking exceptions when talking to just about any AWS service, and especially S3.
I'd be interested in understanding what types of exceptions you are seeing. Are they all the same as the one you included in this issue (e.g. java.net.SocketException: Broken pipe (Write failed)
) or are there others? Do they differ by service?
To consider java.net.SocketException retryable.
The issue is that knowing if a network exception is retryable generically for any operation depends on whether the operation is idempotent. Take the example given, you hit something like a socket closed issue (perhaps the server closed it or there was some other network issue). What if this failure happened after the server modified state? How do we know it's safe to retry at that point?
The answer is we don't (unless the operation is marked as @idempotent). We don't special case idempotent operations yet, that would be a good addition/feature request.
It's possible we may be able to special case S3's retry policy to be more relaxed but again we'd need a good way to classify which networking errors are retryable.
In the meantime you can always customize the retry policy for the SDK to retry whatever kind of error you want. This doesn't require changing the RetryStrategy
, just the policy for what errors are considered retryable.
e.g.
class CustomRetryPolicy : RetryPolicy<Any?> {
override fun evaluate(result: Result<Any?>): RetryDirective {
return when(result.exceptionOrNull()) {
is IOException -> RetryDirective.RetryError(RetryErrorType.Transient)
else -> AwsDefaultRetryPolicy.evaluate(result)
}
}
}
fun main(): Unit = runBlocking {
val s3 = S3Client.fromEnvironment {
retryPolicy = CustomRetryPolicy()
}
...
Note: You'll need a dependency on aws.sdk.kotlin:aws-http:<sdk-version>
to access AwsDefaultRetryPolicy
Thank you for your response. I see the problem with non-idempotent requests and that not all errors (or all kinds of SocketException
) should be considerend retryable.
And thanks for the hint about the RetryPolicy
, we'll try that.
I'd be interested in understanding what types of exceptions you are seeing
I don't have good stats right now, because we also use some older SDK versions in some services, but I believe that from the recent versions, the SocketException
with Broken pipe (Write failed)
is the one we see the most by far, mostly from S3 and sometimes from CloudWatch (PutMetrics
). I see you already use some hacks where you check the exception's message to determine retryability - it seems to me that "Write failed" does qualify for retryability, so one idea could be to add that as a discriminator.
One thing we also see regularly, but I suppose it's unrelated and just a bug in OkHttp, is IllegalStateException: Unbalanced enter/exit
from inside OkHttp. That was why tried switching to the CRT client. Now the connection-related exception from the CRT client (as seen in my initial comment) doesn't seem to be very specific, unfortunately, and doesn't say whether it was a write that failed or a read.
Hi @berlix, are you still encountering socket exceptions when using OkHttp with more recent versions of the SDK?
Googling aws-kotlin-sdk "Unbalanced enter/exit" brought me here :)
@ianbotsf I'm getting Caused by: java.lang.IllegalStateException: Unbalanced enter/exit
when spawning multiple SQS consumers, test: https://gist.github.com/madisp/c3c1e04ad0ef7c34ed66b5545606c846. Seems to reliably happen on my local m1 macbook when there are 10 or more parallel consumers.
(this is on aws.sdk.kotlin:sqs-jvm:1.1.10
)
This might be a test coroutines issue though?
Thanks for the test code @madisp but I cannot reproduce the failure on my local Windows machine or my cloud EC2 Linux machine. It's likely that parallelism and timings are making this difficult to reproduce in different executing environments.
I see that OkHttp recently fixed an unbalanced enter/exit exception caused by rapid release/re-acquisition of connections (https://github.com/square/okhttp/issues/7381). If you have stable repro code, it would be interesting to see if it still fails on OkHttp 5.0.0-alpha.12 (smithy-kotlin currently uses 5.0.0-alpha.11). If not then we can prioritize upgrading the OkHttp dependency to the latest alpha version.
Yup, still getting with 5.0.0-alpha.12
. One thing that makes it easier to reproduce is the number of MESSAGES
and CONSUMERS
- can you try increasing these to 1000
and 100
respectively?
Apologies for the late response. We haven't observed that specific exception ("Broken pipe") with SDK versions 1.0.13 or 1.1.13. We do still regularly get the following exception, at least when downloading objects from S3 - no clue if it's related:
aws.smithy.kotlin.runtime.http.HttpException: java.io.IOException: unexpected end of stream on https://s3.eu-west-1.amazonaws.com/...
at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine.roundTrip(OkHttpEngine.kt:158)
at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine$roundTrip$1.invokeSuspend(OkHttpEngine.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
Caused by: java.io.IOException: unexpected end of stream on https://s3.eu-west-1.amazonaws.com/...
at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:209)
at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:111)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:95)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:84)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:65)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at aws.smithy.kotlin.runtime.http.engine.okhttp.MetricsInterceptor.intercept(MetricsInterceptor.kt:30)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:205)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:537)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.EOFException: \n not found: limit=0 content=…
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:335)
at okhttp3.internal.http1.HeadersReader.readLine(HeadersReader.kt:29)
at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:179)
... 18 common frames omitted
Describe the bug
I am aware of the fix in #896. However, even using SDK versions that include the fix (AWS Kotlin SDK 0.32.0-beta), we run into frequent, unhandled networking exceptions when talking to just about any AWS service, and especially S3.
Expected Behavior
To retry the request, so that application code doesn't have to.
Current Behavior
The request is not retried, and the exception is propagated up to the application:
As a side note: we tried switching to the CRT client, but there we've also observed at least one instance of an exception that looks retryable:
Steps to Reproduce
We observe this mostly (but not only) in long-running tasks that make many concurrent requests to AWS services.
Possible Solution
To consider
java.net.SocketException
retryable.Your Environment