reactiverse / aws-sdk

Using vertx-client for AWS SDK v2
Apache License 2.0
49 stars 14 forks source link

Problem with "Connection reset" for DynamoDB client #40

Open draxly opened 4 years ago

draxly commented 4 years ago

I'm using VertxSdkClient.withVertx to create my non-blocking DynamoDB client and it works. However, I see some occurrences of "Connection reset java.net.SocketException: Connection reset" (the full stack trace is given below) while running my application.

To try and get some more feedback, I added an exceptionHandler to VertxNioAsyncHttpClient like this: `private HttpClient createVertxHttpClient(Vertx vertx) { HttpClientOptions options = new HttpClientOptions() .setSsl(true) .setKeepAlive(true);

    return vertx.createHttpClient(options).connectionHandler(con -> {
        con.exceptionHandler(err -> {
            logger.error("VertxNioAsyncHttpClient connectionHandler.exceptionHandler: " + err.getMessage(), err);
        });
    });
}`

This exceptionHandler is getting called. Any ideas what causes the "Connection reset" or what I can do to avoid them?

The full stack trace as being logged: VertxNioAsyncHttpClient connectionHandler.exceptionHandler: Connection reset java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:345) at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:376) at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1147) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514) at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:830)

aesteve commented 4 years ago

Hello, unfortunately no, I have no idea what is going on, and the stack trace doesn't help me that much.

I'd guess some "proxy" or network hardware may be shutting down the connection, but that's a hard one to debug without involving tcpdump or something like that.

I'll let the issue opened, in case someone else has an idea, or want to report the same bug with more details, more in-depth analysis.

Sorry I couldn't help more on this :\

draxly commented 4 years ago

Thanks aseteve, I appreciate the answer! I will continue digging into it on my end.

Alars-ALIT commented 4 years ago

Reducing the keep alive timeout of the httpClient in VertxNioAsyncHttpClient seems to solve this issue.

HttpClientOptions options = new HttpClientOptions() .setSsl(true) .setKeepAlive(true) .setKeepAliveTimeout(30);

The timeout defaults to 60s.

When setting the timeout to 70s, I get more 'Connection reset's

When setting the timeout to 50s, I got a few resets

When setting the timeout to 30s, I get no resets

So I assume AWS may close connections thats been idle for 30s - <50s. I can't find any documentation about this though.

aesteve commented 4 years ago

Are you sure it's AWS though?

Couldn't it be intermediate networking elements (like load balancers, or stuff like that?)

draxly commented 4 years ago

If DynamoDB service is set up using "VPC Endpoint for DynamoDB" and the DynamoDB client is configured without any special settings, can there still be some intermediate networking element between?

aesteve commented 4 years ago

Not sure, really, just trying to investigate out loud here :\

I have had some problems of keep-alive in the past with Elastic Load Balancers for instance, shutting down the connection after 60s (when using Server Sent Events for instance).

If just setting a keep-alive on the http client fixes it, that's already a good point.

Just trying to know if this should be documented or if it just happens in some specific use-cases.

wem commented 3 years ago

We can confirm the observation of @Alars-ALIT ... With the keep alive timeout of 30s the problem was solved.

Another problem is the retry policy. We get some connection closed exceptions, what's looks like an AWS server issue. The AWS SDK holds a list of exception types it will retry for. Unfortunatelly, for io.vertx.core.VertxException the SDK will not do any retry as the condition is missing. I will create an issue and PR next week, so the users would have a proper retry policy available.