peterbe / django-cache-memoize

Django utility for a memoization decorator that uses the Django cache framework.
https://django-cache-memoize.readthedocs.io/
Mozilla Public License 2.0
159 stars 31 forks source link

"Dog-piling" prevention support #5

Open lamby opened 7 years ago

lamby commented 7 years ago

Under very heavy load when the cached version of that page expires or is missing, there may be sufficient concurrency that multiple workers will all attempt to evaluate the target function simultatenously. This could be "problematic".

Not sure about the best solution here, perhaps some very loose coupling with various locking mechanisms via a path.to.lock.method?

peterbe commented 6 years ago

How do you think that could be solved? If you run something like uwsgi you could use a global variable which I think is best solved with https://docs.python.org/3/library/functools.html#functools.lru_cache

lamby commented 6 years ago

Interesting. I guess the locking should really be cross "instance".. I mean, if you have a single computer/worker then a global variable would work (despite being naturally icky..) but if you have a large number of machines/workers this would still result in one's database being hammered.

Oh, note that the naive solution of pushing it off to other projects, ie:

@cache_memoize(100)
def expensive():
    with redis.Lock('key', …):
        return 999999 ** 999999999

.. won't actually DTRT :)

peterbe commented 6 years ago

The whole point of django-cache-memoize is to use a central memoization. It already does that.

What I think you're talking about is potentially locking the function for other callers whilst processing the first caller. Was that what you had in mind? I.e. you could do something like this:

def memoize_decorator(function):
    def inner(*args, **kwargs):
        cache_key = make_cache_key(*args, **kwargs)
        while cache.get(cache_key + ':processing'):
            time.sleep(1)
        result = cache.get(cache_key)
        if result is None:
            cache.set(cache_key + ':processing', True, 60)
            try:
                result = function(*args, **kwargs)
            finally:
                cache.delete(cache_key + ':processing')
            cache.set(cache_key, result, TTL)
        return result
    return inner

Was that what you had in mind?

lamby commented 6 years ago

is to use a central memoization

Indeed which is (usually) implied to be global by one's Django cache backend. However, the Django cache primitives unfortunately do not provide a locking API call. :)

you could do something like this:

Something like that. But, alas, the time.sleep(1) is rather sub-optimal given you are trying to increase throughout (!), there is a race condition between the cache.get(cache_key) and cache.set(cache_key + ':processing', True, 60) and this is somewhat problematic if function takes more than 60 seconds to execute :)

peterbe commented 6 years ago

It's not easy. Especially when you're distributed across Python processes, but the only way it can be solved is with a central cache.

One possible solution is to be aware of how many Python nodes you have and change a setting accordingly. If you know you only have 1 node (e.g. 1 EC2 instance) you can use Python's builtin lock and if you know you have multiple, you opt for the use of a global cache. But still, even with that there is a race-condition risk if you two independent, near simultaneous, cache-as-a-lock requests come in at the same time.

lamby commented 6 years ago

the only way it can be solved is with a central cache.

Hm, I think we are agreeing on a topic that is orthogonal to this issue. :) I guess we need a redis.Lock that's a "real" semaphore.. ;)

peterbe commented 6 years ago

That's interesting actually. I always/only use Redis for my Django caching so we could make it configurable. I.e. if you know you have Redis and you worry about executing a function (on cold caches) simultaneously you enable this. Wanna take a stab?

lamby commented 6 years ago

I won't be able to commit bandwidth to this soon but I think it would be fair to simply assume redis when one wants this dog-piling prevention. Do note that — as mentioned in https://github.com/peterbe/django-cache-memoize/issues/5#issuecomment-412115632 — the naive solution does not do the right thing as it will "recheck" the cache is warm after waiting on the lock.. but we don't want to invert the logic and serialise checking of the cache in the common/initial case!

peterbe commented 6 years ago

Turns out, I needed this. And since I use Redis for my Django cache backend, I could use cache.lock(). See this blog post: https://www.peterbe.com/plog/django-lock-decorator-with-django-redis

Basically, we could add an option to django-cache-memoize that does this stuff but we'll put in the docs "You have to use django-redis for this to work".

Or, for people who don't use django-redis but have a Redis connection available, they could pass in an instance of it. E.g.

@cache_memoize(100, lock=True)
def expensive_function(start, end):
    return random.randint(start, end)

Or

import redis.client
_redis = redis.client.Redis(config)

@cache_memoize(100, lock=True, redis=_redis)
def expensive_function(start, end):
    return random.randint(start, end)

What do you think?

peterbe commented 6 years ago

Oh. @lamby you were the one who RT'ed this on Twitter. Hi!

lamby commented 6 years ago

Hello @peterbe. :) However, as mentioned above this does not do the Right Thing specifically in that it either seralises all access to the cache (icky, slow, etc...), or if only the underlying calls to expensive_function are serialised the naive solution will result in this being called each time by each client who has to queue, instead of the clients who had to wait being served the result generated by the first client who had a cache miss.

peterbe commented 6 years ago

Good point. A "solution" is to use both django-cache-memoize as always and use something like that @lock_decorator from the blog post. Right?

@lock_decorator()
@cache_memoize(100)
def expensive_function(start, end):
    return random.randint(start, end)

Then, if two concurrent requests come in, the one that is a microsecond later should get the cached result. In theory.

Perhaps, if my logic is sound, the resolution is to mention this pattern in the README.

lamby commented 6 years ago

Doesn't that still serialise all access to the cache as the @lock_decorator() is outside the memoize?

peterbe commented 6 years ago

I don't know if I understand. Perhaps we're talking about different things.

The example just above works very well in theory. Two highly concurrent requests (even one different web heads) are heading towards the @lock_decorator. One of them is a smidge ahead and starts the lock and with the lock held, it proceeds to execute the expensive_function and when it's done it populates the cache. Meanwhile, the second request that was a smidge behind is stuck in the redis.Lock but as soon as that's been released, it continues into the @cache_memoize and there it can read from the cache and thus not execute expensive_function.

Isn't that ideal? Highly concurrent callers and only one global execution of the expensive function.

lamby commented 6 years ago

Highly concurrent callers

Not quite as they are serialised access to the cache via the lock; they should be able to read the lock in parallel. We just need a variation on https://en.wikipedia.org/wiki/Readers%E2%80%93writers_problem etc.

peterbe commented 6 years ago

Can you demonstrate that problem?

If I have...:

@lock_decorator()
@cache_memoize(100)
def expensive_function(start, end):
    print("DOING FANCY WORK")
    return random.randint(start, end)

And run two threads/processes against this (with a timer), I'd expect the output to:

Thread 1: Took 2.1 seconds
Thread 2: Took 2.11 seconds

and on stdout I would only get one:

DOING FANCY WORK

No?

lamby commented 6 years ago

I'm not sure what your testcase is trying to show, sorry. All access to the cache is now serialised, instead of being trivally paralisable. ie. only one request can /check/ the cache at a time whilst "before" this was perfectly safe and legitimate.

atocyo commented 7 months ago

locking should be initiated only if cache returns nothing, just before calculation. This way most of the time cache access is not serialized. best way would be to combine this with refreshing cached data just before cache has expired.