redis / redis-py

Redis Python client
MIT License
12.56k stars 2.51k forks source link

Single node cluster: Failed node is not retried again and the process is stuck unable to connect #2555

Open noorul opened 1 year ago

noorul commented 1 year ago

Version: What redis-py and what redis version is the issue happening on?

redis-py version is 4.4.0

Platform: What platform / version? (For example Python 3.5.1 on Windows 7 / Ubuntu 15.10 / Azure)

$ python --version
Python 3.8.13
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"

Description: Description of your issue, stack traces from errors and code that reproduces the issue

I have the following code in the health check API for the application

@app.get("/healthz")
async def healthz():
    try:
        redis = await Redis().client
        future = await redis.ping()
        assert future is True
    except Exception:
        logger.exception("Failed to perform redis health check")
        raise

The following class is the Singleton wrapper for RedisCluster

class Redis(metaclass=Singleton):
    def __init__(self):
        self._session = None

    @async_property
    async def client(self) -> Union[Awaitable[_Redis], Awaitable[RedisCluster]]:
        if self._session:
            logger.debug("Returning cached redis client")
            return self._session

        assert len(settings.redis_host) > 0 or len(settings.redis_cluster) > 0

        if settings.redis_host:  # pragma: no cover
            self._session = await _Redis.from_url(
                f"redis://{settings.redis_host}",
                password=settings.redis_password or None,
                minsize=2,
                maxsize=5,
                encoding="utf-8",
                decode_responses=True,
            )
            logger.debug("Returning new redis client")
        elif settings.redis_cluster:
            self._session = await RedisCluster(
                startup_nodes=[ClusterNode(*node) for node in settings.redis_cluster],
                password=settings.redis_password or None,
                encoding="utf-8",
                decode_responses=True,
            )
            logger.debug("Returning new redis cluster client")
        else:  # pragma: no cover
            raise ValueError

        return self._session

Kubernetes liveness probe calls /healthz API and fails after some time and logs have the following traceback.

    future = await redis.ping()
  File \"/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py\", line 662, in execute_command
    await self.initialize()
  File \"/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py\", line 369, in initialize
    await self.nodes_manager.initialize()
  File \"/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py\", line 1279, in initialize
    raise RedisClusterException(
redis.exceptions.RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node: None", "@timestamp": "2023-01-18T13:25:01.047106Z"}

After some debugging, I found that the code https://github.com/redis/redis-py/blob/v4.4.0/redis/asyncio/cluster.py#L744 is getting executed. Once the failed node is removed then that thread never recovers.

noorul commented 1 year ago

Here is the actual exception which is causing the node to be removed

{"event": "2023-01-19 15:31:10 [error    ] Failed to perform redis health check request_id=dcca0f39-d43c-4b76-9177-4befa4af2ea0\nTraceback (most recent call last):\n  File \"/app/main.py\", line 90, in healthz\n    future = await redis.ping()\n  File \"/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py\", line 663, in execute_command\n    await self.initialize()\n  File \"/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py\", line 370, in initialize\n    await self.nodes_manager.initialize()\n  File \"/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py\", line 1281, in initialize\n    raise RedisClusterException(\nredis.exceptions.RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node: None", "request_id": "dcca0f39-d43c-4b76-9177-4befa4af2ea0", "logger": "app.main", "level": "error", "process": 43, "@timestamp": "2023-01-19T15:31:10.209442Z"}
{"event": "10.2.91.1:48188 - \"GET /healthz HTTP/1.1\" 500", "request_id": "dcca0f39-d43c-4b76-9177-4befa4af2ea0", "logger": "uvicorn.access", "level": "info", "process": 43, "@timestamp": "2023-01-19T15:31:10.210536Z"}
Traceback (most recent call last):
  File "/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py", line 737, in _execute_command
    return await target_node.execute_command(*args, **kwargs)
  File "/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py", line 1009, in execute_command
    return await self.parse_response(connection, args[0], **kwargs)
  File "/.venv/lib/python3.8/site-packages/redis/asyncio/cluster.py", line 985, in parse_response
    response = await connection.read_response()
  File "/.venv/lib/python3.8/site-packages/redis/asyncio/connection.py", line 800, in read_response
    response = await self._parser.read_response(
  File "/.venv/lib/python3.8/site-packages/redis/asyncio/connection.py", line 373, in read_response
    await self.read_from_socket()
  File "/.venv/lib/python3.8/site-packages/redis/asyncio/connection.py", line 358, in read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR) from None
redis.exceptions.ConnectionError: Connection closed by server.
noorul commented 1 year ago

This is actually causing problems in production. Can someone please confirm this issue?

Artimi commented 1 year ago

Hi, I'm having the same issue

$ pip freeze | grep redis
redis==4.4.2
$ python --version
Python 3.9.15
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
...
Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/cluster.py", line 373, in initialize
    await self.nodes_manager.initialize()
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/cluster.py", line 1295, in initialize
    raise RedisClusterException(
redis.exceptions.RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node: None

This happens after I get

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/cluster.py", line 744, in _execute_command
    return await target_node.execute_command(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/cluster.py", line 1018, in execute_command
    return await self.parse_response(connection, args[0], **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/cluster.py", line 994, in parse_response
    response = await connection.read_response()
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/connection.py", line 829, in read_response
    response = await self._parser.read_response(
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/connection.py", line 256, in read_response
    response = await self._read_response(disable_decoding=disable_decoding)
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/connection.py", line 266, in _read_response
    raw = await self._readline()
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/connection.py", line 343, in _readline
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

several times. Unfortunately I cannot provide minimal example as this is part of our test suit. I don't get why I get the ConnectionError in the first place, because there should be no problems connecting to the server. When I try it in Python REPL or use redis-cli I can connect without problems. I tried to use add condition to https://github.com/redis/redis-py/blob/v4.4.0/redis/asyncio/cluster.py#L744:

                if len(self.nodes_manager.startup_nodes) > 1:
                    self.nodes_manager.startup_nodes.pop(target_node.name, None)

but now it fails also on

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/redis/asyncio/cluster.py", line 373, in initialize
    await self.nodes_manager.initialize()
GeneratorExit

EDIT: I checked the redis debug logs and it seems that client creates 5 connections and then closes them. No log about server closing the connection as would the ConnectionError suggests.

8:M 24 Jan 2023 07:29:40.461 - Accepted 172.23.0.3:50644
8:M 24 Jan 2023 07:29:40.473 - Accepted 172.23.0.3:50646
8:M 24 Jan 2023 07:29:40.473 - Accepted 172.23.0.3:50658
8:M 24 Jan 2023 07:29:40.575 - Accepted 172.23.0.3:50668
8:M 24 Jan 2023 07:29:40.677 - Accepted 172.23.0.3:50678
8:M 24 Jan 2023 07:29:41.635 - DB 0: 3 keys (0 volatile) in 4 slots HT.
8:M 24 Jan 2023 07:29:41.635 . 5 clients connected (0 replicas), 1561560 bytes in use
8:M 24 Jan 2023 07:29:41.719 - Client closed connection
8:M 24 Jan 2023 07:29:41.719 - Client closed connection
8:M 24 Jan 2023 07:29:41.719 - Client closed connection
8:M 24 Jan 2023 07:29:41.719 - Client closed connection
8:M 24 Jan 2023 07:29:41.719 - Client closed connection
noorul commented 1 year ago

In my case also, if I manually access Redis, it works.

Artimi commented 1 year ago

FYI I was trying this with the newest redis version 4.5.1 and the issue persists.

github-actions[bot] commented 7 months ago

This issue is marked stale. It will be closed in 30 days if it is not updated.

noorul commented 7 months ago

This issue is still relevant

harshit98 commented 2 months ago

This is still happening in the production. Is there any update on it ? I'm also facing similar issues - @noorul

noorul commented 2 months ago

I am yet to migrate :(. I am not sure how to get attention from maintainers.