redis / redis-py

Redis Python client
MIT License
12.45k stars 2.48k forks source link

RedisCluster becomes unrecoverable if all nodes timeout #3221

Open kuza55 opened 2 months ago

kuza55 commented 2 months ago

Version: 5.1.2

Platform: Ubuntu 22.04

Description:

RedisCluster becomes unrecoverable and crashes if all the nodes timeout at the same time. If you have a RedisCluster with 1 node, then this is particularly likely.

The crash that happens is:

Traceback (most recent call last):
  File "/app/lib/python3.11/site-packages/opentelemetry/trace/__init__.py", line 573, in use_span
    yield span
  File "/app/lib/python3.11/site-packages/opentelemetry/sdk/trace/__init__.py", line 1046, in start_as_current_span
    yield span
  File "/app/lib/python3.11/site-packages/opentelemetry/instrumentation/redis/__init__.py", line 263, in _async_traced_execute_command
    response = await func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/python3.11/site-packages/redis/asyncio/cluster.py", line 721, in execute_command
    await self.initialize()
  File "/app/lib/python3.11/site-packages/redis/asyncio/cluster.py", line 419, in initialize
    await self.nodes_manager.initialize()
  File "/app/lib/python3.11/site-packages/redis/asyncio/cluster.py", line 1347, in initialize
    raise RedisClusterException(
redis.exceptions.RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node: None

I think this is because of this line where the node is removed, expecting that we will connect to another node and recover the cluster instances from there: https://github.com/redis/redis-py/blob/07fc339b4a4088c1ff052527685ebdde43dfc4be/redis/asyncio/cluster.py#L806

This bug seems similar to, but distinct from https://github.com/redis/redis-py/issues/3130

Also seems related to https://github.com/redis/redis-py/issues/2472

julianogv commented 2 months ago

I'm having the same issue here.

A simple method to reproduce it is to connect to a redis cluster through the internet (AWS Elasticache for example) and then turn your wifi/ethernet off and then enable it again, the error won't stop and it will raise RedisClusterException in a infinite loop.