jazzband / django-redis

Full featured redis cache backend for Django.
Other
2.9k stars 429 forks source link

Does not work with Redis Cluster. #606

Open helgi-reon opened 2 years ago

helgi-reon commented 2 years ago

Problem Statement Use django-redis with redis clustering.

An error occurs when interacting with the cache. The error points to a pickle operation on an instance of the ConnectionPool class where one of it's attributes, a thread lock, cannot be serialized and results in the following error:

   TypeError: cannot pickle '_thread.lock' object

To Reproduce Steps to reproduce the behavior:

  1. Run a redis cluster that REDIS_URL points to.
  2. Setup CACHES in Django settings file.
    CACHES = {
        "default": {
            "BACKEND": "django_redis.cache.RedisCache",
            "LOCATION": REDIS_URL,
            "OPTIONS": {
                "REDIS_CLIENT_CLASS": "redis.cluster.RedisCluster",
                "REDIS_CLIENT_KWARGS": {
                    "url": REDIS_URL,
                },
            }
        }
    }
  3. Run the Django console and try to interact with the cache e.g. cache.get("somekey")

Expected behavior The value of the key from the Redis cluster.

Stack trace

Traceback (most recent call last):
  File "python3.9/site-packages/redis/cluster.py", line 1454, in initialize
    copy_kwargs = copy.deepcopy(kwargs)
  File "python3.9/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "python3.9/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "python3.9/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "python3.9/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "python3.9/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "python3.9/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "python3.9/copy.py", line 161, in deepcopy
    rv = reductor(4)
TypeError: cannot pickle '_thread.lock' object
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "python3.9/site-packages/IPython/core/interactiveshell.py", line 3552, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-48456cec0da8>", line 1, in <cell line: 1>
    from django.core.cache import cache; cache.get("bingo")
  File "python3.9/site-packages/django_redis/cache.py", line 91, in get
    value = self._get(key, default, version, client)
  File "python3.9/site-packages/django_redis/cache.py", line 31, in _decorator
    return method(self, *args, **kwargs)
  File "python3.9/site-packages/django_redis/cache.py", line 98, in _get
    return self.client.get(key, default=default, version=version, client=client)
  File "python3.9/site-packages/django_redis/client/default.py", line 253, in get
    client = self.get_client(write=False)
  File "python3.9/site-packages/django_redis/client/default.py", line 105, in get_client
    self._clients[index] = self.connect(index)
  File "python3.9/site-packages/django_redis/client/default.py", line 118, in connect
    return self.connection_factory.connect(self._server[index])
  File "python3.9/site-packages/django_redis/pool.py", line 72, in connect
    connection = self.get_connection(params)
  File "python3.9/site-packages/django_redis/pool.py", line 92, in get_connection
    return self.redis_client_cls(
  File "python3.9/site-packages/redis/cluster.py", line 592, in __init__
    self.nodes_manager = NodesManager(
  File "python3.9/site-packages/redis/cluster.py", line 1286, in __init__
    self.initialize()
  File "python3.9/site-packages/redis/cluster.py", line 1490, in initialize
    raise RedisClusterException(
redis.exceptions.RedisClusterException: ERROR sending "cluster slots" command to redis server 127.0.0.1:7000. error: cannot pickle '_thread.lock' object

Environment:

Additional Context 🚨 This is my first time using this library and therefore not unlikely that my configurations are wrong.

WisdomPill commented 2 years ago

hello @helgi-reon,

we currently do not support redis cluster, if you want to contribute and create a PR to support it I would be more than happy to review it.

Otherwise I am afraid that this is going to the queue of new features

helgi-reon commented 2 years ago

Ok, good to know. I'll definitely contribute when and if I find a solution 😀

WisdomPill commented 2 years ago

You will need to create a different client probably, and figure out what is the difference when using RedisCluster client. I would have a look at tests in redis-py.

WisdomPill commented 2 years ago

maybe I am into something... but could you try again? maybe this helps? redis/redis-py#2189

helgi-reon commented 2 years ago

I managed to get the caching working in Django while using a Redis Cluster. It was achieved with a little trial and error approach so it most likely not an optimal solution.

from redis import RedisCluster

class CustomRedisClient(RedisCluster):

    def __init__(self, url, **kwargs):
        client = RedisCluster.from_url(url) 
        kwargs["startup_nodes"] = client.get_nodes()
        del kwargs["connection_pool"]
        super().__init__(**kwargs)
# settings.py

# REDIS_URL just needs to point to one of the nodes in the cluster e.g. redis://localhost:6379/0

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URL,
        "OPTIONS": {
            "REDIS_CLIENT_CLASS": "common.redis_client.CustomRedisClient",
            "REDIS_CLIENT_KWARGS": {
                "url": REDIS_URL,
            },
            "PARSER_CLASS": "redis.cluster.ClusterParser",
        }
    }
}

The most interesting part is the deletion of the connection pool attribute. For some reason the connection pool made the correct number of connection nodes but all with the same port. That meant that most of the time it couldn't find a caching value as it was stored on a different node. By removing the connection pool argument the connection pool attribute is populated correctly i.e. with connection nodes for each of the available ports in the cluster.

WisdomPill commented 2 years ago

nice! would you like to open a PR with cluster support?

nicootto commented 1 year ago

Hi there! I've been working to add cluster support to our project based on @helgi-reon approach. I figured RedisCluster should be initialized once per process to be able to have connection pooling, given that on redis-py NodesManager will initialize one connection pool per node https://github.com/redis/redis-py/blob/6c708c2e0511364c2c3f21fa1259de05e590632d/redis/cluster.py#L1426.

This is the code I'm using for now (and on settings pointing REDIS_CLIENT_CLASS to redis_cluster_factory) :

import threading

from redis import RedisCluster

_lock = threading.Lock()

# _cluster is a process-global, as otherwise _cluster is cleared every time
# redis_cluster_factory is called, as Django creates new cache client
# instance for every request.
_cluster = None

def redis_cluster_factory(url, **kwargs):
    # redis-django will pass us a connection pool on kwargs which we don't use
    # given that connection pools are handled per node on RedisCluster.
    global _cluster
    if _cluster is None:
        with _lock:
            if _cluster is None:
                _cluster = RedisCluster(url=url)
    return _cluster

@WisdomPill what do you thing would be the best route to contribute the redis cluster support to the project? As far as I understood, I should:

  1. replace redis_cluster_factory function by something like class CluserConnectionFactory
  2. When initializing RedisCluster, pass CONNECTION_POOL_CLASS as connection_pool_class kwarg and spread CONNECTION_POOL_KWARGS as well.

I don't think we can support different REDIS_CLIENT_CLASS per node, as Redis is hardcoded initialized here https://github.com/redis/redis-py/blob/6c708c2e0511364c2c3f21fa1259de05e590632d/redis/cluster.py#L1426.

Another thing is that I would like to allow users to have cluster and non-cluster caches at the same time, for this I should move DJANGO_REDIS_CONNECTION_FACTORY to be part of "OPTIONS" instead of being taken from global settings (giving OPTIONS.DJANGO_REDIS_CONNECTION_FACTORY precedence over the settings one would be enough).

sondrelg commented 1 year ago

We tried out the implementations above and ran into issues when one of our Redis nodes were replaced. We're operating a redis cluster in kubernetes, so we're expecting this to happen regularly as pods are moved between nodes.

Have you run into this, and do you have any thoughts on how to add failover support to the backend?

WisdomPill commented 1 year ago

@nicootto I would add a subclass or DefaultClient and maybe create decorator or flag to query primary, replicas or specific node. ClusterClient would use a different connection factory

jonprindiville commented 1 year ago

Hey friends, I've been picking around the corners of this as well while trying to migrate off of redis-py-cluster to instead use the redis.cluster.RedisCluster stuff that landed in redis-py...

I've been stealing/consolidating/learning from some of the ideas here, my current thoughts are over in this gist: https://gist.github.com/jonprindiville/f97084ca8f91501c17175a7b7a9578af

There's a connection factory there, because the way that base django_redis.pool.ConnectionFactory behaves doesn't jive with redis.cluster.RedisCluster -- they kind of both want to manage their own connection pools. Since we're doing this specifically for RedisCluster we can set that as default redis-client-class and save some configs.

The base django_redis.client.default.DefaultClient mostly behaves fine for this application, but I have a client that overrides its choice of connection factory with our own so we don't have to fiddle with the global DJANGO_REDIS_CONNECTION_FACTORY setting. I think this is what @WisdomPill was getting at in that comment?

(Aside: I do think that pluggable connection factory would be neat, as suggested by @nicootto, but maybe that's a separate feature, idk)

Given this kind of a setup, my Django settings would be like the following:

CACHES = {
    'default': {
        'BACKEND': 'django_redis.cache.RedisCache',
        'LOCATION': '...',
        'OPTIONS': {
            'CLIENT_CLASS': 'cluster_client.DjangoRedisClusterClient',
        },
    },
}

If it helps, as a starting point, I could get a PR going for that stuff. There's probably gonna be large holes that need filling, though, before that amounts to good cluster support.

Like, I'm not real sure how you'd want to go about adding cluster stuff to your CI processes, what kind of coverage to add, not sure about the target_nodes stuff that I think @WisdomPill was indicating, I feel like I'm pretty minimally exposed to advanced clustery use-cases, etc


ran into issues when one of our Redis nodes were replaced. We're operating a redis cluster in kubernetes, so we're expecting this to happen regularly as pods are moved between nodes. [...] do you have any thoughts on how to add failover support to the backend?

@sondrelg Hm, I would wonder if that might be more of a redis-py concern.

I know that the clients there seem to have some amount of error recovery, e.g. retrying based on MOVED responses, rediscovering nodes, etc.

Perhaps tweaking some of the parameters at that level helps you recover from some types of situations? Like cluster_error_retry_attempts, retry, retry_on_timeout, retry_on_error, reinitialize_steps perhaps? I'm a bit unsure.

Without understanding all of what's going on inside of redis-py I wouldn't know what to suggest WRT adding failover on top at this layer.


I don't think we can support different REDIS_CLIENT_CLASS per node, as Redis is hardcoded initialized here https://github.com/redis/redis-py/blob/6c708c2e0511364c2c3f21fa1259de05e590632d/redis/cluster.py#L1426.

@nicootto True, yes. But! I think that from the perspective of django-redis the REDIS_CLIENT_CLASS is redis.cluster.RedisCluster.

The fact that inside of redis.cluster.RedisCluster there is a dependency on using the redis.client.Redis is a bit of an implementation detail of RedisCluster, IMO. If we wanted flexibility there, that's probably a feature-request to the redis-py folks.

joeyorlando commented 10 months ago

edit: it is quite possible this could be an issue on GCP's side as the internal server error (code -1) is occurring in read_response immediately after an AUTH command here (GCP docs on this)

we're running this in production over in https://github.com/grafana/oncall against a GCP Memorystore managed Redis Cluster.

We've taken inspiration from this GitHub issue, as well as some conversations over in redis-py. Our setup mostly works, however lately we've been seeing some rather cryptic/unexplainable exceptions popping up that I thought would be worthwhile posting here in the event that anyone else is seeing the same thing. The unexplainable exception we see from time to time is:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/django_redis/client/default.py", line 258, in get
    value = client.get(key)
            ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/commands/core.py", line 1829, in get
    return self.execute_command("GET", name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/cluster.py", line 1115, in execute_command
    raise e
  File "/usr/local/lib/python3.11/site-packages/redis/cluster.py", line 1101, in execute_command
    res[node.name] = self._execute_command(node, *args, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/cluster.py", line 1144, in _execute_command
    connection = get_connection(redis_node, *args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/cluster.py", line 51, in get_connection
    return redis_node.connection or redis_node.connection_pool.get_connection(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 1086, in get_connection
    connection.connect()
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 279, in connect
    self.redis_connect_func(self)
  File "/usr/local/lib/python3.11/site-packages/redis/cluster.py", line 672, in on_connect
    connection.on_connect()
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 342, in on_connect
    auth_response = self.read_response()
                    ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 524, in read_response
    raise response
redis.exceptions.ResponseError: Internal server error (code -1)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/django_redis/cache.py", line 29, in _decorator
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django_redis/cache.py", line 99, in _get
    return self.client.get(key, default=default, version=version, client=client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django_redis/client/default.py", line 260, in get
    raise ConnectionInterrupted(connection=client) from e
django_redis.exceptions.ConnectionInterrupted: Redis ResponseError: Internal server error (code -1)

Our (simplified/abridged) setup is as follows:

django_redis_client.py

import threading
import typing
from copy import deepcopy

from django.core.exceptions import ImproperlyConfigured
from django_redis.client.default import DefaultClient
from django_redis.pool import ConnectionFactory
from redis.cluster import RedisCluster

class ClusterConnectionFactory(ConnectionFactory):
    """
    A connection factory compatible with `redis.cluster.RedisCluster`
    The cluster client manages connection pools internally, so we don't want to do it at this level like the base
    `ConnectionFactory` does.
    """

    # A global cache of URL->client so that within a process, we will reuse a
    # single client, and therefore a single set of connection pools.
    _clients: typing.Dict[str, RedisCluster] = {}
    _clients_lock = threading.Lock()

    def connect(self, url: str) -> RedisCluster:
        """Given a connection url, return a client instance.
        Prefer to return from our cache but if we don't yet have one build it
        to populate the cache.
        """
        if url not in self._clients:
            with self._clients_lock:
                if url not in self._clients:
                    self._clients[url] = self._connect(url)
        return self._clients[url]

    def _connect(self, url: str) -> RedisCluster:
        """
        Given a connection url, return a new client instance.
        Basic `django-redis` `ConnectionFactory` manages a cache of connection
        pools and builds a fresh client each time. because the cluster client
        manages its own connection pools, we will instead merge the
        "connection" and "client" kwargs and throw them all at the client to
        sort out.
        If we find conflicting client and connection kwargs, we'll raise an
        error.
        """
        # Get connection and client kwargs...
        connection_params = self.make_connection_params(url)
        client_cls_kwargs = deepcopy(self.redis_client_cls_kwargs)

        # ... and smash 'em together (crashing if there's conflicts)...
        for key, value in connection_params.items():
            if key in client_cls_kwargs:
                raise ImproperlyConfigured(f"Found '{key}' in both the connection and the client kwargs")
            client_cls_kwargs[key] = value

        # ... and then build and return the client
        return self.redis_client_cls(**client_cls_kwargs)

    def disconnect(self, connection: RedisCluster):
        connection.disconnect_connection_pools()

class RedisClient(DefaultClient):
    """
    A `django-redis` client compatible with `redis.cluster.RedisCluster`.
    We don't do much different here, except for using our own `ClusterConnectionFactory`
    (the base class would instead use the value of the
    `DJANGO_REDIS_CONNECTION_FACTORY` setting, but we don't care about that setting here)

    For non-clustered Redis, `django-redis`'s `ConnectionFactory` does some management of
    connection pools shared between client instances.
    The cluster client in `redis-py` doesn't accept a connection pool from outside, they're
    managed internally. To support that, we won't be caching connection pools and passing
    them into clients, we will instead be caching client instances.

    https://github.com/jazzband/django-redis/issues/606#issuecomment-1505615249
    https://gist.github.com/jonprindiville/f97084ca8f91501c17175a7b7a9578af
    """

    def __init__(self, server, params, backend) -> None:
        super().__init__(server, params, backend)
        self.connection_factory = ClusterConnectionFactory(options=self._options)

redis_cluster_client.py

import base64
import json
import logging
import typing

import cachetools.func
import redis.cluster
import redis.credentials
from django.conf import settings
from google.cloud import iam_credentials_v1

logger = logging.getLogger(__name__)

class GCPMemoryStoreCredentialsProvider(redis.credentials.CredentialProvider):
    """
    Credentials Provider to fetch an IAM Auth token to be able to authenticate w/ GCP Memorystore.

    Inspired by the following:
    https://redis-py.readthedocs.io/en/stable/examples/connection_examples.html#Connecting-to-a-redis-instance-with-AWS-Secrets-Manager-credential-provider
    https://cloud.google.com/memorystore/docs/cluster/client-library-connection#iam_authentication_client_library_code_sample
    """

    def __init__(self) -> None:
        service_account_json = json.loads(base64.b64decode(settings.GOOGLE_APPLICATION_CREDENTIALS_JSON_BASE64))
        self.iam_client = iam_credentials_v1.IAMCredentialsClient.from_service_account_info(service_account_json)
        self.service_account_email = self.iam_client._transport._credentials.service_account_email

    def get_credentials(self) -> typing.Tuple[str]:
        """
        NOTE: Access tokens expire in one hour (by default), hence the 45min cache TTL

        https://cloud.google.com/memorystore/docs/cluster/about-iam-auth#iam_access_token_time_frame
        https://cloud.google.com/memorystore/docs/cluster/manage-iam-auth#connect_to_an_instance_that_uses_iam_authentication
        """

        @cachetools.func.ttl_cache(maxsize=128, ttl=45 * 60)  # 45mins
        def _get_iam_access_token() -> str:
            logger.info(
                "GCPMemoryStoreCredentialsProvider.get_credentials - "
                f"Generating IAM access token for {self.service_account_email}"
            )

            request = iam_credentials_v1.GenerateAccessTokenRequest(
                name=f"projects/-/serviceAccounts/{self.service_account_email}",
                # https://developers.google.com/identity/protocols/oauth2/scopes#redis
                scope=["https://www.googleapis.com/auth/cloud-platform"],
            )

            return self.iam_client.generate_access_token(request=request).access_token

        return (_get_iam_access_token(),)

# NOTE: this is a global singleton
credential_provider = GCPMemoryStoreCredentialsProvider()

class RedisClusterShim(redis.cluster.RedisCluster):
    def __init__(self, *args, **kwargs):
        kwargs["credential_provider"] = credential_provider
        super().__init__(*args, **kwargs)

django settings.py

REDIS_URI = "<uri_to_our_memorystore_discovery_ip>"

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URI,
        "OPTIONS": {
            "PARSER_CLASS": "redis.connection._HiredisParser",
            "CONNECTION_POOL_CLASS": "redis.BlockingConnectionPool",
            "CONNECTION_POOL_CLASS_KWARGS": {
                "max_connections": 50,
                "timeout": 20,
            },
            "MAX_CONNECTIONS": 1000,
            "PICKLE_VERSION": -1,
            "SOCKET_CONNECT_TIMEOUT": 3,
            "SOCKET_TIMEOUT": 3,
            # Custom redit client to handle redis downtimes and reconnections
            "CLIENT_CLASS": "django_redis_client.RedisClient",
            # NOTE: REDIS_CLIENT_CLASS is different from the CLIENT_CLASS config.. from the docs:
            #
            # django-redis uses the Redis client redis.client.StrictClient by default. It is possible to
            # use an alternative client. You can customize the client used by setting REDIS_CLIENT_CLASS in
            # the CACHES setting. Optionally, you can provide arguments to this class by setting REDIS_CLIENT_KWARGS
            #
            # https://github.com/jazzband/django-redis#pluggable-redis-client
            "REDIS_CLIENT_CLASS": "redis_cluster_client.RedisClusterShim",
            "REDIS_CLIENT_KWARGS": {
                "read_from_replicas": True,
                # the following setting is required by GCP Memorystore (and AWS ElasticCache AFAIK)
                #
                # from the docs
                # https://cloud.google.com/memorystore/docs/cluster/connect-cluster-instance#redis-py_client_best_practice
                #
                # To connect to your Memorystore for Redis Cluster instance using the redis-py Python client
                # you must add the skip_full_coverage_check=True when declaring a Redis Cluster:
                "skip_full_coverage_check": True,
                # The path to a file of concatenated CA certificates in PEM format
                # NOTE: needed because we have in-transit encryption enabled in GCP Memorystore
                # https://cloud.google.com/memorystore/docs/cluster/about-in-transit-encryption
                "ssl_ca_certs": os.environ.get("REDIS_SSL_CA_CERTS"),
            },
        },
    },
}
jonprindiville commented 10 months ago

we're running this in production over in https://github.com/grafana/oncall against a GCP Memorystore managed Redis Cluster. [...] however lately we've been seeing some rather cryptic/unexplainable exceptions popping up that I thought would be worthwhile posting here in the event that anyone else is seeing the same thing. [...] it is quite possible this could be an issue on GCP's side as the internal server error (code -1) is occurring in read_response immediately after an AUTH command

@joeyorlando Hm, I'm afraid that I don't have any relevant experiences like that to share at the moment, sorry.

I have given that code a bit of run in a staging environment, but I have not yet given it extended time in production against real traffic.

Also: In my context I'm not interacting with GCP, I'm in AWS with an ElastiCache cluster (currently compatible with Redis 5.0.6, but I think soon we will move to a version compatible with Redis 7.)

I think I may have to spend some time on this soon, though 😅 I think I'm blocked on something else until I can drop redis-py-cluster and update redis-py + django-redis.

I think it'd be relatively easy to get a PR together for this existing gist/example stuff but the largest questions in my mind in the immediate future for django-redis to support redis-py's cluster client is WRT the django-redis test suite...

We've taken inspiration from this GitHub issue, as well as some conversations over in redis-py

Oh, can you point at any relevant redis-py conversations? I'm not actively tracking anything on that side at the moment.

joeyorlando commented 10 months ago
  • how to incorporate a Redis Cluster into the django-redis CI process?
  • do we need additional cluster-specific tests, or we just run the existing non-cluster tests minus the multi-key or otherwise-incompatible commands?

if I can be of any help with a PR here, let me know!

On a side note, turns out my cryptic Internal server error (code -1) exceptions mentioned above were related to Memorystore's IAM authentication.. GCP's support's recommendation was to "add a retry" for AUTH commands as the failures I am seeing "fall within their 99.9% SLA" (but this happens fairly deep in the redis-py internals, so adding a retry here would likely involve forking redis-py 🙄).

Just figured I'd throw this ☝️ here in case anyone else in the community runs into the same issue.

WisdomPill commented 10 months ago

I would love to see a PR about support for redis cluster, I would use the same tactic we're currently using for sentinel (there's a script in tests/start_redis.sh), although I was trying to setup everything like it is done in redis-py with no success in #583.