Open neumantm opened 3 years ago
@terencehonles since you are the one more familiar with sentinel, can you give your opinion here?
Is there only a need for more documentation?
@neumantm do you mind creating a PR for updating the documentation if you think this is unclear?
I'm not compeletely sure if you're saying this should not be set via OPTIONS["SENTINEL_KWARGS"] = {"socket_timeout": 1}
since that seems fine to me.
If you want to set the connection_kwargs
that's documented in the section about the connection pool: https://github.com/jazzband/django-redis#configure-default-connection-pool (the example is "max_connections" and "retry_on_timeout", but the list is not exhaustive).
You can find all of this in the code that sets up the sentinels: https://github.com/jazzband/django-redis/blob/b1a68e05dfccf462cb5538e239ea78ba26067811/django_redis/pool.py#L148-L157
where pool_cls_kwargs
comes from the superclass and is documented in the README as I linked above.
If you're interested it might make sense to update the code to ask a single sentinel for the info and if it hasn't responded in 500ms (or whatever makes sense for a failover) that another is asked. If the sentinel is just delayed and does return in 505ms the first response is used, but otherwise successive sentinels are tried instead of making them be queried in series (and also not in parallel, though that could also make sense if people don't care about the extra connections)
If you're interested it might make sense to update the code to ask a single sentinel for the info and if it hasn't responded in 500ms (or whatever makes sense for a failover) that another is asked. If the sentinel is just delayed and does return in 505ms the first response is used, but otherwise successive sentinels are tried instead of making them be queried in series (and also not in parallel, though that could also make sense if people don't care about the extra connections)
That is what I think should happen. And it also does happen already, but instead of the 500ms it takes about 30 seconds to failover, which is way to long. When doing the workaround as I described it is only 1 second, which is ok. However, I cannot seem to find the code location where the about 30 seconds come from.
What is the default socket_timeout
when using a minimal config (e.g. the "minimal"
one from https://github.com/jazzband/django-redis#use-the-sentinel-connection-factory)?
What you're describing/suggesting exists is not what I am suggesting. I'm suggesting a lagged parallelism which allows overlapping connections.
You may also be interested in the "short hand" arguments and quite possibly SOCKET_CONNECT_TIMEOUT
over SOCKET_TIMEOUT
, but you may want both to be short (the first defaults to the other):
https://github.com/jazzband/django-redis/blob/827eb4e2607189fab8f6e22c67be1068c45ca72f/django_redis/pool.py#L50-L62
As far as defaults, that's not provided by this library and doesn't seem to be provided by redis-py:redis/sentinel.py or redis-py:redis/connection.py either, so I would assume the socket timeouts are the default from Python/your OS.
@neumantm Did you find the solution for this issue? I got the same problem the connection may hang for minutes which causing 504 Gateway timeout. I am trying to find the timeout value for socket timeout with 1,5,10 OPTIONS["SENTINEL_KWARGS"] = {"socket_timeout": os.environ['socket_timeout']}.
I think I just kept using the workaround given in the issue description with 1 second.
@neumantm thanks for your reply. Your current solution is 1 second for the socket_timeout value right? For my system, I think setting 1 second the socket_timeout is a temporary solution to avoid the service hang. I cannot find the root cause which breaks the redis connections after amount of time.
Due to a poorly configured socket_timeout for the connection to the sentinels, the connection may hang for many seconds or even minutes when one ore more of the configured sentinels are down. Then the django cache hangs for the same time causing http timeouts for users. This is bad, as the purpose of using sentinels is that something can go down without the user noticing.
A workaround is setting
OPTIONS["SENTINEL_KWARGS"] = {"socket_timeout": 1}
.I think we should set a better default. ( It seems the current default is somewhere around 30 seconds, which means with two offline sentinels it would be around 60 seconds of hang. According to redis-py the default is gotten from the connection_kwargs, but I do not know where that is set)
It should at least be documented.