jazzband / django-redis

Full featured redis cache backend for Django.
Other
2.83k stars 428 forks source link

Using sentinel config hangs for many seconds if one or more sentinels are down #540

Open neumantm opened 3 years ago

neumantm commented 3 years ago

Due to a poorly configured socket_timeout for the connection to the sentinels, the connection may hang for many seconds or even minutes when one ore more of the configured sentinels are down. Then the django cache hangs for the same time causing http timeouts for users. This is bad, as the purpose of using sentinels is that something can go down without the user noticing.

A workaround is setting OPTIONS["SENTINEL_KWARGS"] = {"socket_timeout": 1}.

I think we should set a better default. ( It seems the current default is somewhere around 30 seconds, which means with two offline sentinels it would be around 60 seconds of hang. According to redis-py the default is gotten from the connection_kwargs, but I do not know where that is set)

It should at least be documented.

WisdomPill commented 2 years ago

@terencehonles since you are the one more familiar with sentinel, can you give your opinion here?

Is there only a need for more documentation?

terencehonles commented 2 years ago

@neumantm do you mind creating a PR for updating the documentation if you think this is unclear?

I'm not compeletely sure if you're saying this should not be set via OPTIONS["SENTINEL_KWARGS"] = {"socket_timeout": 1} since that seems fine to me.

If you want to set the connection_kwargs that's documented in the section about the connection pool: https://github.com/jazzband/django-redis#configure-default-connection-pool (the example is "max_connections" and "retry_on_timeout", but the list is not exhaustive).

You can find all of this in the code that sets up the sentinels: https://github.com/jazzband/django-redis/blob/b1a68e05dfccf462cb5538e239ea78ba26067811/django_redis/pool.py#L148-L157

where pool_cls_kwargs comes from the superclass and is documented in the README as I linked above.

terencehonles commented 2 years ago

If you're interested it might make sense to update the code to ask a single sentinel for the info and if it hasn't responded in 500ms (or whatever makes sense for a failover) that another is asked. If the sentinel is just delayed and does return in 505ms the first response is used, but otherwise successive sentinels are tried instead of making them be queried in series (and also not in parallel, though that could also make sense if people don't care about the extra connections)

neumantm commented 2 years ago

If you're interested it might make sense to update the code to ask a single sentinel for the info and if it hasn't responded in 500ms (or whatever makes sense for a failover) that another is asked. If the sentinel is just delayed and does return in 505ms the first response is used, but otherwise successive sentinels are tried instead of making them be queried in series (and also not in parallel, though that could also make sense if people don't care about the extra connections)

That is what I think should happen. And it also does happen already, but instead of the 500ms it takes about 30 seconds to failover, which is way to long. When doing the workaround as I described it is only 1 second, which is ok. However, I cannot seem to find the code location where the about 30 seconds come from.

What is the default socket_timeout when using a minimal config (e.g. the "minimal" one from https://github.com/jazzband/django-redis#use-the-sentinel-connection-factory)?

terencehonles commented 2 years ago

What you're describing/suggesting exists is not what I am suggesting. I'm suggesting a lagged parallelism which allows overlapping connections.

You may also be interested in the "short hand" arguments and quite possibly SOCKET_CONNECT_TIMEOUT over SOCKET_TIMEOUT, but you may want both to be short (the first defaults to the other): https://github.com/jazzband/django-redis/blob/827eb4e2607189fab8f6e22c67be1068c45ca72f/django_redis/pool.py#L50-L62

As far as defaults, that's not provided by this library and doesn't seem to be provided by redis-py:redis/sentinel.py or redis-py:redis/connection.py either, so I would assume the socket timeouts are the default from Python/your OS.

phandc commented 4 days ago

@neumantm Did you find the solution for this issue? I got the same problem the connection may hang for minutes which causing 504 Gateway timeout. I am trying to find the timeout value for socket timeout with 1,5,10 OPTIONS["SENTINEL_KWARGS"] = {"socket_timeout": os.environ['socket_timeout']}.

neumantm commented 4 days ago

I think I just kept using the workaround given in the issue description with 1 second.

phandc commented 3 days ago

@neumantm thanks for your reply. Your current solution is 1 second for the socket_timeout value right? For my system, I think setting 1 second the socket_timeout is a temporary solution to avoid the service hang. I cannot find the root cause which breaks the redis connections after amount of time.