redis / redis-py

Redis Python client
MIT License
12.4k stars 2.48k forks source link

RedisCluster client failing to reconnect to AWS Elasticache cluster after node failover #3284

Open avnandu opened 2 weeks ago

avnandu commented 2 weeks ago

Version: redis-py version 5.0.4

Platform: Python 3.9

Description: After node failovers, our RedisCluster clients sometimes fail to reconnect to the cluster. When we run test failovers on our cluster (only failing over one primary node one time), usually clients are able to reconnect after a short period of ConnectionError/TimeoutErrors. However, when we do node type upgrades, for example, or any other type of action that will cause all nodes in a cluster to failover, the clients persistently cannot reconnect and throw RedisClusterException:Redis Cluster cannot be connected. Please provide at least one reachable node: <None, or some IP, Timeout connecting to server>. We also see persistent TimeoutError when this happens. Our Elasticache redis cluster instances are running redis engine 6.X. We are wondering if we are configuring the client wrong in some way. The client only gets initialized once so we are not creating new clients/connections for each redis command. Example of how we're initializing the client:

        args.update(
            host=self.host, # Elasticache endpoint
            port=self.port,  # default redis port
            read_from_replicas=True,
            retry=self.retry, # Retry(backoff=FullJitterBackoff(), retries=7)
            ssl=self.ssl,
            ssl_cert_reqs=None
        )
        return RedisCluster(**args)

Are there any other params to the client that we need to include? I was wondering if dynamic_startup_nodes could have something to do with the issue.