Client degradation during node failure

justinreddit commented 6 days ago

Hello

Been using this client for some redis clusters and noticed that it doesn't gracefully handle node failure. There are two setups we'd like to support

Clusters with replicas - we've observed that if the master fails, the client is unable to serve any requests.
Clusters without replicas - we've observed that if a node fails, the client will continue sending requests to unresponsive nodes and timing out.

Is it possible to have the client partition off unresponsive nodes so that it can continue operating?

rueian commented 6 days ago

Hi @justinreddit,

I think the current implementation can discover the topology changes and recover from them. Could you elaborate more about your setup and what client version you use?

Clusters with replicas - we've observed that if the master fails, the client is unable to serve any requests.

Just to confirm, your cluster setup is this, instead of this, right?

In the current implementation, the client is unable to serve any requests, which should only happen if all your masters fail and no replicas are promoted.

Clusters without replicas - we've observed that if a node fails, the client will continue sending requests to unresponsive nodes and timing out.

I think this is the correct behavior: the client should simply follow the topology it discovered from the CLUSTER SHARDS response to send commands. You can use a shorter context deadline to avoid a long timeout.

justinreddit commented 5 days ago

I think this is the correct behavior: the client should simply follow the topology it discovered from the CLUSTER SHARDS response to send commands. You can use a shorter context deadline to avoid a long timeout.

Is there a way to provide a node level circuit breaker or manage the node topology directly? We've had a few incidents where a node becomes unresponsive and we'd want to ignore those slots without affecting client latency.

Edit: By setting the cluster to not failover during a node failure things work as expected. One thing I did find confusing is that in cases of nodes being unresponsive, even though we had ConnWriteTimeout set to 200ms -- latencies spike beyond that. I would expect tail latencies to cap out at 200ms, could you clarify the expected behavior of configuring that value?

rueian commented 5 days ago

Is there a way to provide a node level circuit breaker or manage the node topology directly? We've had a few incidents where a node becomes unresponsive and we'd want to ignore those slots without affecting client latency.

I understand that sometimes penetrations are desirable, but I currently don't have a good idea of the interface that would allow you to have this functionality. The only way we currently have is wrapping connections by customizing the DialFn.

Edit: By setting the cluster to not failover during a node failure things work as expected.

It's good news but, however, weird. How can no failover work as expected?

One thing I did find confusing is that in cases of nodes being unresponsive, even though we had ConnWriteTimeout set to 200ms. could you clarify the expected behavior of configuring that value?

ConnWriteTimeout is used to detect nonresponsiveness from connected TCP connections when you send requests to them. For the timeout of making a new connection, the Dialer.Timeout is applied.

even though we had ConnWriteTimeout set to 200ms -- latencies spike beyond that. I would expect tail latencies to cap out at 200ms

I think it is the effect of our auto-retry on read-only commands. All read-only commands will be retried until the context deadline is exceeded by default. You can disable this by setting DisableRetry: true.

justinmir commented 5 days ago

ConnWriteTimeout is used to detect nonresponsiveness from connected TCP connections when you send requests to them. For the timeout of making a new connection, the Dialer.Timeout is applied. One interesting behavior we are seeing is that even with a dialer timeout, client requests to a shard that is down can still take up to 1 second.

This behavior can be narrowed down to the implementation of the connection close hook. See call path: _pipe -> close hook -> lazy refresh -> lazyDo with a 1 second threshold.

Based on my understanding LazyDo allows an action to be performed lazily every threshold seconds. It does this by allowing a single caller to perform the action. If the action has been performed within the lastthreshold that caller sleeps until the threshold has elapsed since last action. This guarantees that if a user has called LazyDo, the action will happen within the next threshold.

However, This can cause the caller of LazyDo to wait for up to threshold if the action was just called. Notably, this can cause a request to a failed server to wait for up to the threshold of 1 second when handling connection closed errors. One approach around this is to remove the sleep and only perform the refresh if more than a threshold has passed. See https://github.com/justinmir/rueidis/pull/1 for more details on this.

rueian commented 5 days ago

Hi @justinmir, thank you for catching this. But only performing a refresh when a threshold is passed is not very good because we might miss the chance to refresh early.

Say we do a refresh at t=0, then the next refresh can only happen after t=1. If the topology changes again at t=0.5, A following request at t=2 can fail and trigger another refresh.

So a better way of interpreting lazyDo is “skip the Do function only when we know that there is already someone will do it later”.

That is, we schedule a refresh at t=0, and the refresh will happen at t=1. If the topology changes again at t=0.5, we will notice that at t=1. A following request at t=2 will succeed and do not trigger refresh again.

I think we can do the sleep in the same goroutine of the do function.

go func(ts, ch, fn) {
    time.Sleep(time.Until(ts))
    c.do(ch, fn)
}(ts , ch, fn)

rueian commented 4 days ago

Hi @justinmir, @justinreddit can this issue be closed?

rueian commented 3 days ago

We have released v1.0.49-alpha.2, containing a fix of failover https://github.com/redis/rueidis/pull/660. Please give it a try.

nikhillimaje commented 3 days ago

https://github.com/redis/rueidis/issues/384?new_signup=true?new_signup=true#issuecomment-2453217724

redis / rueidis

Client degradation during node failure #658