Open lamby opened 7 years ago
How do you think that could be solved? If you run something like uwsgi
you could use a global variable which I think is best solved with https://docs.python.org/3/library/functools.html#functools.lru_cache
Interesting. I guess the locking should really be cross "instance".. I mean, if you have a single computer/worker then a global variable would work (despite being naturally icky..) but if you have a large number of machines/workers this would still result in one's database being hammered.
Oh, note that the naive solution of pushing it off to other projects, ie:
@cache_memoize(100)
def expensive():
with redis.Lock('key', …):
return 999999 ** 999999999
.. won't actually DTRT :)
The whole point of django-cache-memoize
is to use a central memoization. It already does that.
What I think you're talking about is potentially locking the function for other callers whilst processing the first caller. Was that what you had in mind? I.e. you could do something like this:
def memoize_decorator(function):
def inner(*args, **kwargs):
cache_key = make_cache_key(*args, **kwargs)
while cache.get(cache_key + ':processing'):
time.sleep(1)
result = cache.get(cache_key)
if result is None:
cache.set(cache_key + ':processing', True, 60)
try:
result = function(*args, **kwargs)
finally:
cache.delete(cache_key + ':processing')
cache.set(cache_key, result, TTL)
return result
return inner
Was that what you had in mind?
is to use a central memoization
Indeed which is (usually) implied to be global by one's Django cache backend. However, the Django cache primitives unfortunately do not provide a locking API call. :)
you could do something like this:
Something like that. But, alas, the time.sleep(1)
is rather sub-optimal given you are trying to increase throughout (!), there is a race condition between the cache.get(cache_key)
and cache.set(cache_key + ':processing', True, 60)
and this is somewhat problematic if function
takes more than 60 seconds to execute :)
It's not easy. Especially when you're distributed across Python processes, but the only way it can be solved is with a central cache.
One possible solution is to be aware of how many Python nodes you have and change a setting accordingly. If you know you only have 1 node (e.g. 1 EC2 instance) you can use Python's builtin lock and if you know you have multiple, you opt for the use of a global cache. But still, even with that there is a race-condition risk if you two independent, near simultaneous, cache-as-a-lock requests come in at the same time.
the only way it can be solved is with a central cache.
Hm, I think we are agreeing on a topic that is orthogonal to this issue. :) I guess we need a redis.Lock
that's a "real" semaphore.. ;)
That's interesting actually. I always/only use Redis for my Django caching so we could make it configurable. I.e. if you know you have Redis and you worry about executing a function (on cold caches) simultaneously you enable this. Wanna take a stab?
I won't be able to commit bandwidth to this soon but I think it would be fair to simply assume redis when one wants this dog-piling prevention. Do note that — as mentioned in https://github.com/peterbe/django-cache-memoize/issues/5#issuecomment-412115632 — the naive solution does not do the right thing as it will "recheck" the cache is warm after waiting on the lock.. but we don't want to invert the logic and serialise checking of the cache in the common/initial case!
Turns out, I needed this. And since I use Redis for my Django cache backend, I could use cache.lock()
. See this blog post: https://www.peterbe.com/plog/django-lock-decorator-with-django-redis
Basically, we could add an option to django-cache-memoize
that does this stuff but we'll put in the docs "You have to use django-redis for this to work".
Or, for people who don't use django-redis
but have a Redis connection available, they could pass in an instance of it. E.g.
@cache_memoize(100, lock=True)
def expensive_function(start, end):
return random.randint(start, end)
Or
import redis.client
_redis = redis.client.Redis(config)
@cache_memoize(100, lock=True, redis=_redis)
def expensive_function(start, end):
return random.randint(start, end)
What do you think?
Oh. @lamby you were the one who RT'ed this on Twitter. Hi!
Hello @peterbe. :) However, as mentioned above this does not do the Right Thing specifically in that it either seralises all access to the cache (icky, slow, etc...), or if only the underlying calls to expensive_function
are serialised the naive solution will result in this being called each time by each client who has to queue, instead of the clients who had to wait being served the result generated by the first client who had a cache miss.
Good point. A "solution" is to use both django-cache-memoize
as always and use something like that @lock_decorator
from the blog post. Right?
@lock_decorator()
@cache_memoize(100)
def expensive_function(start, end):
return random.randint(start, end)
Then, if two concurrent requests come in, the one that is a microsecond later should get the cached result. In theory.
Perhaps, if my logic is sound, the resolution is to mention this pattern in the README.
Doesn't that still serialise all access to the cache as the @lock_decorator()
is outside the memoize?
I don't know if I understand. Perhaps we're talking about different things.
The example just above works very well in theory. Two highly concurrent requests (even one different web heads) are heading towards the @lock_decorator
. One of them is a smidge ahead and starts the lock and with the lock held, it proceeds to execute the expensive_function
and when it's done it populates the cache. Meanwhile, the second request that was a smidge behind is stuck in the redis.Lock
but as soon as that's been released, it continues into the @cache_memoize
and there it can read from the cache and thus not execute expensive_function
.
Isn't that ideal? Highly concurrent callers and only one global execution of the expensive function.
Highly concurrent callers
Not quite as they are serialised access to the cache via the lock; they should be able to read the lock in parallel. We just need a variation on https://en.wikipedia.org/wiki/Readers%E2%80%93writers_problem etc.
Can you demonstrate that problem?
If I have...:
@lock_decorator()
@cache_memoize(100)
def expensive_function(start, end):
print("DOING FANCY WORK")
return random.randint(start, end)
And run two threads/processes against this (with a timer), I'd expect the output to:
Thread 1: Took 2.1 seconds
Thread 2: Took 2.11 seconds
and on stdout I would only get one:
DOING FANCY WORK
No?
I'm not sure what your testcase is trying to show, sorry. All access to the cache is now serialised, instead of being trivally paralisable. ie. only one request can /check/ the cache at a time whilst "before" this was perfectly safe and legitimate.
locking should be initiated only if cache returns nothing, just before calculation. This way most of the time cache access is not serialized. best way would be to combine this with refreshing cached data just before cache has expired.
Under very heavy load when the cached version of that page expires or is missing, there may be sufficient concurrency that multiple workers will all attempt to evaluate the target function simultatenously. This could be "problematic".
Not sure about the best solution here, perhaps some very loose coupling with various locking mechanisms via a
path.to.lock.method
?