Closed mzapletal closed 8 years ago
@mzapletal That's normal behavior except the one fact that the message is not nice to view. Something at the TCP/IP (or below) layer gets mixed up or broken and so the connection is no longer functional.
Do you run also into disconnects of other services or is is just disque/spinach?
Thanks for the quick response. I would say there happen a few more reconnects for disque/spinach than we experience for other services such as ActiveMQ. Database connections are very stable - or put the other way round are refreshed frequently.
The thing which made me wondering is that the disque clients fail over to a different node, but which should not be a problem due to the multi-master model of disque.
What about disposing/refreshing spinach/lettuce connections after some period as done for example by database connection pooling frameworks? Or do you recommend using ClientOptions.pingBeforeActivateConnection
?
Hm. ClientOptions.pingBeforeActivateConnection
is only something for the initial connect/reconnect but does not keep the connection alive. Do you know on which commands this error happens? And whether it is related to no activity on the TCP connection?
I'm not sure, whether this could help, but you could try to enable SO_KEEPALIVE
(on disque and on spinach). Since spinach has no dedicated option for that, you would be required to create a subclass of DisqueClient
, override the protected void connectionBuilder(CommandHandler<?, ?> handler, RedisChannelHandler<?, ?> connection, Supplier<SocketAddress> socketAddressSupplier, ConnectionBuilder connectionBuilder, RedisURI redisURI)
method with:
protected void connectionBuilder(CommandHandler<?, ?> handler, RedisChannelHandler<?, ?> connection,
Supplier<SocketAddress> socketAddressSupplier, ConnectionBuilder connectionBuilder, RedisURI redisURI) {
super.connectionBuilder(handler, connection, socketAddressSupplier, connectionBuilder, redisURI);
connectionBuilder.bootstrap().option(ChannelOption.SO_KEEPALIVE, true);
}
I checked what others say about that topic. Connection timed out
is not directly related to Read timeout
. A Connection timed out
happens when the TCP retransmit timeout is reached. This may happen due to package loss etc. (see https://community.oracle.com/thread/1148354)
Thanks for the thorough investigation. To avoid polluting logs (assuming that this is easy to resolve by reconnecting) what do you think about logging the exception on debug? As far as I remember the reconnect is logged on info anyway.
That's the only thing that's left. I have to adjust the error handling anyway, because:
I would change two things:
IOException
s will be logged on DEBUG
level.Does this make sense?
:+1: makes perfectly sense for me
Will be implemented in https://github.com/mp911de/lettuce/issues/140
Implemented in lettuce 3.4 (snapshot build). This ticket requires a 3.4 final release to be closed.
Closing this ticket as I released 0.3 to Maven central.
I know this is pretty bad to track and it may just happen due to simple/small network outages. Nevertheless, I would like to ask if you are aware of this issue and may have a solution for it: we are experiencing connection timeouts to our disque nodes quite frequently (about 3-4 times a day). Such a connection timeout involves a reconnect to the other node then (we are running a cluster of 2 nodes). I see that there is no spinach code (but only netty code) involved, but is there any advise for configuring spinach (or netty under the hood) and avoid these reconnects?