Closed slai11 closed 2 months ago
Could there be a form of server response that could lead to an incomplete client/slot map?
I think the setting of REDIS_CLIENT_MAX_STARTUP_SAMPLE=1
has a trade-off between the load of servers and the reliability of the cluster state information. When a node temporarily has a stale state, our client may ends up in a mess by that node. So I configured the environment variable to a large size value on our test cases come along with unhealthy states.
To the maintainers, in your experience developing this client library, was this (small percentage of NodeMightBeDown with a seemingly healthy cluster) behaviour something that could have happened?
I experienced that our CI was flaky when nodes replied inconsistent responses by CLUSTER NODES
command. But it was occured almost on start-up phases.
Was there possibility that something was being happened in your cluster bus even if the state was healthy? Or there may be some bugs in our client. I'll look into it later. Unfortunately, since I use redis gem v4 in my work, I have no experience to use redis-cluster-client gem with a long-running cluster in a production environment.
We found the redis cluster service ip is the same as the one of the pod ip (lead node). When the request was sent to this pod, it was actually sent to the service, which might redirect to any of the pods.
Thanks for sharing the links. I found the first one during initial stages of investigation too but it is not entirely relevant as we deploy the Redis cluster in VMs with fixed dns name (no chance of the service/pod IP confusion).
Was there possibility that something was being happened in your cluster bus even if the state was healthy?
It could be possible. For context, the cluster's cluster-node-timeout
is set to 5000ms. The actual production Redis Cluster has 3 shards, each containing 1 primary and 2 secondary, totalling 9 VMs and redis-server
processes (apologies for the discrepancy above, I wanted to keep the issue description simpler). We observed that for 8 of the 9 redis-server
processes, there was a drop in rate of pings received and pongs sent by ~12% or 1/9. More details in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3714#note_2040420223 if you are curious.
I suspected what could have happened for us:
RedisClient::Cluster
initialisation, there was a 1/3 chance of sending CLUSTER NODES
to the affected master redis-server
process which returns a complete non-error response but with an unknown amount of fail?
flags. Those rows get ignored when parsing the CLUSTER NODES
response.@slots
would be incomplete.CLUSTER NODES
state.Note: we lost the metrics and logs for the affected VM unfortunately, so there is some inference to be done here.
In any case, I think the lesson here is to sample more master nodes and be mindful of the trade-off since users could see a spike in Redis command traffic during deployments.
I think this is the wrong place for this retry to happen - actually the retry needs to happen one level up, in assign_node
-
So that after refreshing the cluster topology with @node.reload!
(that actually asks the randomly-selected primaries for CLUSTER NODES
and CLUSTER SLOTS
), we actually re-run the slot -> node assignment so that it doesn't keep trying to send commands to the dead node.
Thank you for your reporting. That's definitely right. I think it's a bug. I'll fix later.
@slai11 I've fixed the behavior related to this issue. One is the mitigation for the frequency of queries with CLUSTER NODES command, and another one is the enhancement for the recoverability from the state of cluster down.
@supercaracal thank you for the improvements! I think this issue can be closed for now (I'm not sure what is your workflow for that).
It is hard to reproduce the events of the incident separately to validate the fixes. But I'll update if we do encounter it again 👍
Feel free to reopen this issue when happening again.
Issue
A small but non-trival percentage of
Redis::Cluster::NodeMightBeDown
is seen on some of my Sidekiq jobs. I understand that this error is raised in thefind_node
method after 3 retries where@node.reload!
is called on each retry in an attempt to fix the@topology.clients
and@slots
hash.Setup
For context, the affected Redis server is a 3-shard Redis Cluster. But looking at the observability metrics, we are fairly confident that the cluster state was healthy during the incident window (~2 hours). If the cluster state were unhealthy, the impact would have been more much severe.
We also configure
REDIS_CLIENT_MAX_STARTUP_SAMPLE=1
.Part of the stack trace:
I'm running on the following gem versions
Investigation details
We observed an increase in incoming new TCP connection on 1 of the 3 VMs containing a master redis-server process. This would match the 3
@node.reload!
retries which would open a new connection for callingCLUSTER NODES
and close it thereafter.I've ruled out server-side network issues since the
redis-client
would raise aConnectionError
when that happens. I've verified this while attempting reproduce this locally with a Redis-Cluster setup configured with very lowmaxclients
andtcp-backlog
values. I ended up withRedisClient::Cluster::InitialSetupError
when trying to reload the nodes.I've been unable to locally reproduce this behaviour (will update when I do). The client suggests that the server is down but the server seems fine. Could there be a form of server response that could lead to an incomplete client/slot map?
To the maintainers, in your experience developing this client library, was this (small percentage of
NodeMightBeDown
with a seemingly healthy cluster) behaviour something that could have happened?Linking issue for reference: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3715