Topology refresh on consistent timeout

GilboaAWS commented 2 years ago

Bug Report

Current Behavior

While working with Lettuce against Redis cluster, when one of the nodes gets stuck, but doesn't crash, e.g. catching the process by gdb, the node doesn't reply, which leads to ops timeout. In this case, the node is considered as FAIL/PFAIL to the other nodes, but Lettuce has no idea about it. All the topology refresh option, the periodic and the adaptive don't contain a timeout issue. The closest adaptive trigger is the PERSISTENT_RECONNECTS, but In this case, the connection watchdog sees everything is ok as the tcp is in the kernel that keeps on buffering the data to the stuck Redis node.

I know timeouts can occur by many reasons, e.g. low command timeout with a huge key-value, or just unreasonable command timeout, but I think it's something that should be configurable.

Expected behavior/code

A topology refresh upon timeouts

Environment

Lettuce version(s): 6.0.5.RELEASE
Redis version: 6.2.5

Possible Solution

An option to trigger a topology refresh upon a timeout. To add a mechanism that counts the amount of timeouts in a configurable period of time and trigger an adaptive topology refresh if it exceeds.

mp911de commented 1 year ago

Redis doesn't have the friendliest protocol as each thing we send to the server is considered a command. There's no ping frame or similar that would yield a response even when the server runs a blocking command. As long as the remote accepts/holds a TCP connection, we have to consider the connection healthy.

tishun commented 2 months ago

Unfortunately this case seems very specific issue and we might not get to it. Any contributions are welcome, but unless there is traction from the community we will leave it on ice for now.

redis / lettuce