Connections timing out triggering a reconnect

mzapletal commented 9 years ago

I know this is pretty bad to track and it may just happen due to simple/small network outages. Nevertheless, I would like to ask if you are aware of this issue and may have a solution for it: we are experiencing connection timeouts to our disque nodes quite frequently (about 3-4 times a day). Such a connection timeout involves a reconnect to the other node then (we are running a cluster of 2 nodes). I see that there is no spinach code (but only netty code) involved, but is there any advise for configuring spinach (or netty under the hood) and avoid these reconnects?


An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. [i.n.c.DefaultChannelPipeline] [nioEventLoopGroup-2-4] 
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_79]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_79]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_79]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_79]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.7.0_79]
    at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) ~[netty-buffer-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) ~[netty-buffer-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) ~[netty-transport-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) ~[netty-transport-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) [netty-transport-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) [netty-transport-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [netty-transport-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [netty-transport-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) [netty-common-4.0.28.Final.jar:4.0.28.Final]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [netty-common-4.0.28.Final.jar:4.0.28.Final]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
java.io.IOException: Connection timed out
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
    at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447)
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

mp911de commented 9 years ago

@mzapletal That's normal behavior except the one fact that the message is not nice to view. Something at the TCP/IP (or below) layer gets mixed up or broken and so the connection is no longer functional.

Do you run also into disconnects of other services or is is just disque/spinach?

mzapletal commented 9 years ago

Thanks for the quick response. I would say there happen a few more reconnects for disque/spinach than we experience for other services such as ActiveMQ. Database connections are very stable - or put the other way round are refreshed frequently.
The thing which made me wondering is that the disque clients fail over to a different node, but which should not be a problem due to the multi-master model of disque.

What about disposing/refreshing spinach/lettuce connections after some period as done for example by database connection pooling frameworks? Or do you recommend using ClientOptions.pingBeforeActivateConnection?

mp911de commented 9 years ago

Hm. ClientOptions.pingBeforeActivateConnection is only something for the initial connect/reconnect but does not keep the connection alive. Do you know on which commands this error happens? And whether it is related to no activity on the TCP connection?

I'm not sure, whether this could help, but you could try to enable SO_KEEPALIVE (on disque and on spinach). Since spinach has no dedicated option for that, you would be required to create a subclass of DisqueClient, override the protected void connectionBuilder(CommandHandler<?, ?> handler, RedisChannelHandler<?, ?> connection, Supplier<SocketAddress> socketAddressSupplier, ConnectionBuilder connectionBuilder, RedisURI redisURI) method with:

 protected void connectionBuilder(CommandHandler<?, ?> handler, RedisChannelHandler<?, ?> connection,
            Supplier<SocketAddress> socketAddressSupplier, ConnectionBuilder connectionBuilder, RedisURI redisURI) {
    super.connectionBuilder(handler, connection, socketAddressSupplier, connectionBuilder, redisURI);
    connectionBuilder.bootstrap().option(ChannelOption.SO_KEEPALIVE, true);
}

mp911de commented 9 years ago

I checked what others say about that topic. Connection timed out is not directly related to Read timeout. A Connection timed out happens when the TCP retransmit timeout is reached. This may happen due to package loss etc. (see https://community.oracle.com/thread/1148354)

mzapletal commented 9 years ago

Thanks for the thorough investigation. To avoid polluting logs (assuming that this is easy to resolve by reconnecting) what do you think about logging the exception on debug? As far as I remember the reconnect is logged on info anyway.

mp911de commented 9 years ago

That's the only thing that's left. I have to adjust the error handling anyway, because:

If a command is sent and awaits an answer, the command is canceled with the exception (that's the way it works right now) -> The execution state of the command is unknown and better let the user know about it instead pretending everything to be fine.
Even though the exception is consumed, it is still logged and pollutes the logs

I would change two things:

Do not log the exception at all if it was consumed/attached to a command. The user can do that if he's interested
connection resets by peer and connection timed out IOExceptions will be logged on DEBUG level.

Does this make sense?

mzapletal commented 9 years ago

:+1: makes perfectly sense for me

mp911de commented 9 years ago

Will be implemented in https://github.com/mp911de/lettuce/issues/140

mp911de commented 9 years ago

Implemented in lettuce 3.4 (snapshot build). This ticket requires a 3.4 final release to be closed.

mp911de commented 8 years ago

Closing this ticket as I released 0.3 to Maven central.

mp911de / spinach

Connections timing out triggering a reconnect #12