jamadden / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 2 forks source link

regex module is not thread safe because of _cache #98

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Calling regex.search() internally calls _shrink_cache(_cache, _named_args, 
_MAXCACHE)
2. This method modifies the module level variable called _cache
3. When re.search() or any method that calls regex._compile() is called from 
multiple threads it results in the following error:

RuntimeError: dictionary changed size during iteration

What is the expected output? What do you see instead?
Expected output is that the method should work correctly when called from 
multiple threads

What version of the product are you using? On what operating system?
OS: Mac OS X 10.8.5
Version: regex==2013-08-04 (output of pip freeze | grep regex)

Please provide any additional information below.
The easiest way to re-produce the problem is to add the following 2 lines of 
code to the for loop at the end of the _shrink_cache() method in _regex_core.py
        import time, random
        time.sleep(random.randint(1, 2))

Then run the attached code (regex_bug.py) to reproduce the error.

There are several approaches to fix the problem. We are currently disabling the 
internal cache (since we don't want to modify source) altogether by monkey 
patching the regex module as follows:

class EmptyDict(dict):
    """
        Dictionary that never stores any values
    """
    def __setitem__(self, *args, **kwargs):
        return None

import regex
regex._cache = EmptyDict()

To fix bug in the source we can simply wrap the call to _shrink_cache() in 
regex module's _compile() function with a re-entrant lock as follows:

# add these 2 lines below _cache declaration:

import threading
_cache_lock = threading.RLock()

# replace the body of `if len(_cache) >= _MAXCACHE:` in _compile() with the 
following:
         try:
             _cache_lock.acquire()
             _shrink_cache(_cache, _named_args, _MAXCACHE)
         finally:
             _cache_lock.release()

Original issue reported on code.google.com by sha...@glossi.com on 25 Oct 2013 at 1:07

Attachments:

GoogleCodeExporter commented 9 years ago
The Python version is 2.7.2

Original comment by sha...@glossi.com on 25 Oct 2013 at 1:11

GoogleCodeExporter commented 9 years ago
Fixed in source code.

I'm going to delay releasing it on PyPI just in case yet another issue arrives. 
This issue arrived only an hour after the last release, and I prefer not to 
release more than day!

Original comment by re...@mrabarnett.plus.com on 25 Oct 2013 at 3:25

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
functool.lru_cache may be a better solution(instead of random.sample).

Python 3 only, but can backport to Python 2 since it's pure Python

Original comment by Lyricconch on 25 Oct 2013 at 11:20

GoogleCodeExporter commented 9 years ago
The code was suggested as a better way of handling it than just clearing the 
entire cache if it's full.

I think I'll leave keep it as-is unless you can prove that some other way is 
definitely better.

Original comment by re...@mrabarnett.plus.com on 25 Oct 2013 at 5:26

GoogleCodeExporter commented 9 years ago
Fixed in regex 2013-10-26.

Original comment by re...@mrabarnett.plus.com on 25 Oct 2013 at 11:13

GoogleCodeExporter commented 9 years ago
functools.lru_cache(from docs.python.org):

LRU (least recently used) cache works best when the most recent calls are the 
best predictors of upcoming calls (for example, the most popular articles on a 
news server tend to change each day). The cache’s size limit assures that the 
cache does not grow without bound on long-running processes such as web servers.

it dose not clear all entities. 
cache flush keeps least recently used entities instead of random.sample.

Original comment by Lyricconch on 1 Nov 2013 at 12:18