tkem / cachetools

Extensible memoizing collections and decorators
MIT License
2.34k stars 163 forks source link

cachedmethod not working as expected with locks #279

Closed npezolano closed 1 year ago

npezolano commented 1 year ago

The below example runs, however it seems like it's not working as expected. In the below example calc will get called multiple times for the same n when it should be locked by a thread and then cached, I tried using both Lock and RLock and get the same result. The cache does seem to be working if I rerun res = obj.multithreaded_run() below everything will be cached, the problem looks to be if the function to be cached takes up to a few seconds to run and its required by multiple threads it won't lock properly. The problem gets worse if you use more threads. This is a follow up on the previous suggestion in #274

from multiprocessing.pool import ThreadPool as Pool

from threading import Lock, RLock
import cachetools
import operator
import time
import numpy as np

cores = 4

class DataObj(object):
    def __init__(self, enable_cache=True, **kwargs):
        self.enable_cache = enable_cache
        self.lock = RLock()
        self.cache = cachetools.LRUCache(maxsize=200)

    @cachetools.cachedmethod(operator.attrgetter('cache'), lock=operator.attrgetter('lock'))
    def calc(self, n): # Should only be called once for each n
        print(f'running n : {n}')
        sleeptime = np.random.uniform(0, 1) 
        time.sleep(sleeptime) # Sleep timer to simulate loading some data
        return n

class RunnerObj(object):
    def __init__(self, data_obj=None, window=60, size=100, **kwargs):
        self.data_obj = data_obj
        self.window = window
        self.size = size

    def run(self, data):
        n = 0
        for i in data:
            n += self.data_obj.calc(i)
        return n

    def multithreaded_run(self):
        seq = [i for i in range(self.size)]
        window_size = self.window
        data = []
        # Create example window of data to run
        for i in range(len(seq) - window_size + 1):
           data.append(seq[i: i + window_size])

        pool = Pool(cores)
        res = pool.map(self.run, data)
        return res

obj = RunnerObj(data_obj=DataObj())
res = obj.multithreaded_run()

See screenshot for example: each n should only print once: image

npezolano commented 1 year ago

https://github.com/tkem/cachetools/pull/277/commits/7677e5076578b6eb4a3c1a7caf89a79809035d8b looks to fix the issues I'm facing above but causing a small performance impact

tkem commented 1 year ago

@npezolano: the title of this issue "not working as expected with locks" is something that I would have to admit is probably true, sorry. However, it is AFAIK the same behavior the standard library's @lru_cache decorator shows, and any PRs trying to fix this have been rejected due to added complexity and runtime overhead. See #224 for a somewhat thorough discussion. So, sorry, unless someone comes up with some truly ingenious idea, this will be left open.

tkem commented 1 year ago

On a final note: Yes, it's an issue, probably an issue that should be at least stated in the docs. However, if this is a real issue for your individual use case, it's probably also an issue in your back-end service, i.e. this would arise even without caching (probably more so), and maybe should be handled there...

npezolano commented 1 year ago

I'm sure removing thread safety everywhere in the documentation will fix the issue @tkem