Client is raising `Redis::Cluster::NodeMightBeDown` when Redis Cluster is observed to be healthy

slai11 commented 3 months ago

Issue

A small but non-trival percentage of Redis::Cluster::NodeMightBeDown is seen on some of my Sidekiq jobs. I understand that this error is raised in the find_node method after 3 retries where @node.reload! is called on each retry in an attempt to fix the @topology.clients and @slots hash.

Setup

For context, the affected Redis server is a 3-shard Redis Cluster. But looking at the observability metrics, we are fairly confident that the cluster state was healthy during the incident window (~2 hours). If the cluster state were unhealthy, the impact would have been more much severe.

We also configure REDIS_CLIENT_MAX_STARTUP_SAMPLE=1.

Part of the stack trace:

[redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:201:in `rescue in find_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:198:in `find_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:159:in `assign_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:60:in `send_command'
 redis-cluster-client (0.8.2) lib/redis_client/cluster.rb:35:in `call_v'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:85:in `block in call_v'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:104:in `handle_errors'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:85:in `call_v'
 redis (5.2.0) lib/redis.rb:152:in `block in send_command'
 redis (5.2.0) lib/redis.rb:151:in `synchronize'
 redis (5.2.0) lib/redis.rb:151:in `send_command'
 redis (5.2.0) lib/redis/commands/strings.rb:95:in `set'

I'm running on the following gem versions

redis-cluster-client on v0.8.2
redis-client on v0.22.2
redis on v5.2.0

Investigation details

We observed an increase in incoming new TCP connection on 1 of the 3 VMs containing a master redis-server process. This would match the 3 @node.reload! retries which would open a new connection for calling CLUSTER NODES and close it thereafter.

I've ruled out server-side network issues since the redis-client would raise a ConnectionError when that happens. I've verified this while attempting reproduce this locally with a Redis-Cluster setup configured with very low maxclients and tcp-backlog values. I ended up with RedisClient::Cluster::InitialSetupError when trying to reload the nodes.

I've been unable to locally reproduce this behaviour (will update when I do). The client suggests that the server is down but the server seems fine. Could there be a form of server response that could lead to an incomplete client/slot map?

To the maintainers, in your experience developing this client library, was this (small percentage of NodeMightBeDown with a seemingly healthy cluster) behaviour something that could have happened?

Linking issue for reference: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3715

supercaracal commented 3 months ago

Could there be a form of server response that could lead to an incomplete client/slot map?

I think the setting of REDIS_CLIENT_MAX_STARTUP_SAMPLE=1 has a trade-off between the load of servers and the reliability of the cluster state information. When a node temporarily has a stale state, our client may ends up in a mess by that node. So I configured the environment variable to a large size value on our test cases come along with unhealthy states.

245

https://github.com/redis-rb/redis-cluster-client/blob/628276b0266aaa66c49c8d19467e4135a5070567/test/test_against_cluster_broken.rb#L84-L93

To the maintainers, in your experience developing this client library, was this (small percentage of NodeMightBeDown with a seemingly healthy cluster) behaviour something that could have happened?

I experienced that our CI was flaky when nodes replied inconsistent responses by CLUSTER NODES command. But it was occured almost on start-up phases.

Was there possibility that something was being happened in your cluster bus even if the state was healthy? Or there may be some bugs in our client. I'll look into it later. Unfortunately, since I use redis gem v4 in my work, I have no experience to use redis-cluster-client gem with a long-running cluster in a production environment.

supercaracal commented 3 months ago

slai11 commented 3 months ago

We found the redis cluster service ip is the same as the one of the pod ip (lead node). When the request was sent to this pod, it was actually sent to the service, which might redirect to any of the pods.

Thanks for sharing the links. I found the first one during initial stages of investigation too but it is not entirely relevant as we deploy the Redis cluster in VMs with fixed dns name (no chance of the service/pod IP confusion).

Was there possibility that something was being happened in your cluster bus even if the state was healthy?

It could be possible. For context, the cluster's cluster-node-timeout is set to 5000ms. The actual production Redis Cluster has 3 shards, each containing 1 primary and 2 secondary, totalling 9 VMs and redis-server processes (apologies for the discrepancy above, I wanted to keep the issue description simpler). We observed that for 8 of the 9 redis-server processes, there was a drop in rate of pings received and pongs sent by ~12% or 1/9. More details in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3714#note_2040420223 if you are curious.

I suspected what could have happened for us:

During RedisClient::Cluster initialisation, there was a 1/3 chance of sending CLUSTER NODES to the affected master redis-server process which returns a complete non-error response but with an unknown amount of fail? flags. Those rows get ignored when parsing the CLUSTER NODES response.
The resulting @topology.clients map is incomplete since the node info list is incomplete. Also, @slots would be incomplete.
Subsequent recovery through retrying was not successful since the sample of 1 always selects the same instance to fetch an updated CLUSTER NODES state.

Note: we lost the metrics and logs for the affected VM unfortunately, so there is some inference to be done here.

In any case, I think the lesson here is to sample more master nodes and be mindful of the trade-off since users could see a spike in Redis command traffic during deployments.

KJTsanaktsidis commented 2 months ago

https://github.com/redis-rb/redis-cluster-client/blob/1967399b53bfb0c5647c2102bd5cc103a618ad14/lib/redis_client/cluster/router.rb#L200-L208

I think this is the wrong place for this retry to happen - actually the retry needs to happen one level up, in assign_node -

https://github.com/redis-rb/redis-cluster-client/blob/1967399b53bfb0c5647c2102bd5cc103a618ad14/lib/redis_client/cluster/router.rb#L159-L162

So that after refreshing the cluster topology with @node.reload! (that actually asks the randomly-selected primaries for CLUSTER NODES and CLUSTER SLOTS), we actually re-run the slot -> node assignment so that it doesn't keep trying to send commands to the dead node.

supercaracal commented 2 months ago

Thank you for your reporting. That's definitely right. I think it's a bug. I'll fix later.

supercaracal commented 2 months ago

@slai11 I've fixed the behavior related to this issue. One is the mitigation for the frequency of queries with CLUSTER NODES command, and another one is the enhancement for the recoverability from the state of cluster down.

slai11 commented 2 months ago

@supercaracal thank you for the improvements! I think this issue can be closed for now (I'm not sure what is your workflow for that).

It is hard to reproduce the events of the incident separately to validate the fixes. But I'll update if we do encounter it again 👍

supercaracal commented 2 months ago

Feel free to reopen this issue when happening again.

redis-rb / redis-cluster-client