Closed dougjc closed 6 years ago
I've seen this issue before but there's a workaround that works for me.
I tried your repro case and the code passed without issue:
(env36) 2018-10-23 21:29:14 grantj @ Grants-MacBook-Pro : ~/repos/python-diskcache
$ ipython
Python 3.6.5 (default, Mar 29 2018, 15:37:32)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import os
In [2]: os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'tests.settings')
Out[2]: 'tests.settings'
In [3]: import django
In [4]: django.setup()
In [5]: from django.core.cache import cache
In [6]: def f():
...: cache.get('key')
...:
In [7]: import threading
In [8]: for i in range(50):
...: threading.Thread(target=f).start()
...:
In [9]:
But I've seen failures like yours in CI tests and not been able to reproduce them.
The workaround is at https://github.com/grantjenks/python-diskcache/blob/master/diskcache/core.py#L377. The high timeout (60 seconds) means that BEGIN IMMEDIATE should not raise an exception for 60 seconds.
I've seen issues with this and PRAGMAs but not with regular inserts. You'll find this comment documenting the workaround for PRAGMA statements: https://github.com/grantjenks/python-diskcache/blob/master/diskcache/core.py#L1871
How long does your snippet run before throwing the exception?
Multi-threaded and multi-process programming is tricky so I may be missing something obvious. I didn't understand your comment about "each django thread using a private cache object." Could you explain more?
My snippet fails immediately which makes me think the DATABASE_TIMEOUT is not being respected.
On the "each django thread using a private cache object.", note here https://github.com/django/django/blob/master/django/core/cache/__init__.py#L61 that the CacheHandler uses thread local storage for the self._caches variable so that each thread gets its own instance of the cache object returned from _cache_create(). You can see this path in the backtrace in the initial comments.
Interesting. I hadn’t seen that code in Django before. But it shouldn’t matter. Disk Cache is intended to be entirely thread and process safe.
What’s your platform like? I’m running locally on a Mac with CPython 3.6 built from macports.
You might also try inserting the retry logic that I pointed to around pragmas around the insert/update code for settings. Shouldn’t be necessary but I’d be willing to accept it as a workaround. I wonder if updating pragmas could cause issues for other threads. Unfortunately the SQLite docs don’t indicate what kind of locks are acquired for each pragma (but I should maybe ask).
My platform is a Macbook pro with SSD and you can see the python build info in the output at the start of the first comment.
But I don't think that is the issue...more likely a timing issue. Could you share the cache configuration you are using?
I think you are correct on that the django cache object per thread should not matter -- it just makes it much more likely to have multiple threads running through the initialization in parallel doing all the setup commands and thus hitting the sqlite 'database is locked' error.
Hunting around, I think this is the same generic sqlite issue as described here.
I updated the _sql property to wrap a retry on sqlite3.OperationalError and that takes are of the issue. By adding print I can see the retry is getting executed.
From
return con.execute
to
def sqlr(statement, *args, **kwargs):
start = None
while True:
try:
return con.execute(statement, *args, **kwargs)
except sqlite3.OperationalError:
start = start or time.time()
if time.time() - start > self._timeout:
raise
time.sleep(0.01)
return sqlr
Clever! The sqlr function simulated the expected timeout/retry behavior.
Could you refactor that to a separate _sql_retry property that calls _sql and wraps the function? Your current changes make core.Cache into a FanoutCache (no Timeout exceptions). I’d also like to verify that the retry behavior is only needed inside the initializer method.
I just saw the SQLite pragma for busy timeout: https://www.sqlite.org/pragma.html#pragma_busy_timeout could you query that on your machine?
I had already run with forcing the pragma busy timeout after the initial connection, but it made no difference, this plus the immediate exception is what leads me to believe that the busy timeout is not the issue but rather the issue as described in the link 'sqlite database locked' is unavoidable when using multiple connections.
Weird that you are not able to reproduce the issue. How many shards do you have in your test configuration? I think increasing this number would make it more likely to hit the error. Also, what sort of disk are the shards using? This would also affect the timing.
The only other difference I can think of is the version/configuration of sqlite itself. I have I believe a standard install:
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> pragma compile_options;
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
sqlite>
I can also see that the errors are also occurring in the normal cache.set() operations but they are hidden/handled by the fanoutCache Timeout/retry mechanism.
On the _sql_retry request, you are requesting that this be used only for the __init__()
routine? If we used it within the reset()
routine as well, we could remove the pragma retry stuff.
Yes, I think "database is locked" may be unavoidable but I'm having difficulty pinning down the issue. I thought I had a reproducible version of the issue last night but it has disappeared this morning :( Oddly, it disappeared after I ran the benchmarks to assess any performance penalty of your retry-function wrapper.
In my local testing, I am using the defaults: 8 shards, 10ms timeout, 1GB size limit, standard core.Disk type.
Because SQLite is embedded, the compile options of your SQLite executable and the one embedded in Python are likely different. I've written a small script to exercise the topics we've discussed in this issue and pushed it to the master branch. Would you mind cloning the repo and running the following:
$ git clone git@github.com:grantjenks/python-diskcache.git diskcache-test
$ cd diskcache-test/
$ python3.6 -m venv /tmp/env36
$ source /tmp/env36/bin/activate
$ python -m pip install "Django<2"
$ export PYTHONPATH=`pwd`
$ python tests/issue_85.py
For that last command, what's your output? Mine looks like:
REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
RUNNING MULTI-THREADING INIT TEST
SQLITE COMPILE OPTIONS
COMPILER=clang-9.1.0
DISABLE_INTRINSIC
ENABLE_COLUMN_METADATA
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_RTREE
ENABLE_STAT4
ENABLE_UNLOCK_NOTIFY
SECURE_DELETE
SOUNDEX
THREADSAFE=1
CREATING DATA TABLE
RUNNING read/write
Error count: 0
RUNNING write/read
Error count: 0
RUNNING begin/read/write
Error count: 85
RUNNING begin/write/read
Error count: 0
RUNNING begin immediate/read/write
Error count: 0
RUNNING begin immediate/write/read
Error count: 0
RUNNING begin exclusive/read/write
Error count: 0
RUNNING begin exclusive/write/read
Error count: 0
If you can't pass the "RUNNING MULTI-THREADING INIT TEST" section then would you comment out that code and include the "SQLITE COMPILE OPTIONS" along with everything under "CREATING DATA TABLE". Thanks so much for your help!
What I see in the output matches my understanding of SQLite. The only errors occur from transactions that use the deferred lock and so get stuck between reads/writes. All the other scenarios work without issue.
Assuming you have different error counts, could I ask you to run two more scripts:
$ python tests/benchmark_core.py -p 1
$ python tests/benchmark_core.py -p 8
$ python tests/issue_85.py
I'm just baffled that running the benchmarks could impact the transactions but that's the behavior I observed this morning!?!?@!#%$@#
I also had to modify your retry-wrapper. Now, inside Cache._sql I've experimented with:
con = self._con
def execute_with_retry(statement, *args, **kwargs):
start = time.time()
while True:
try:
return con.execute(statement, *args, **kwargs)
except sqlite3.OperationalError as exc:
if exc.args[0] != 'database is locked':
raise
diff = time.time() - start
if diff > self._timeout:
raise
time.sleep(0.001)
return execute_with_retry
I had to add the if exc.args[0] != 'database is locked':
check. Without that check, it gets stuck waiting for 60 seconds on the missing Settings table during initialization. Did you not have that issue? (Note that I do not have the above patch applied for any of the script output shown above.)
Thanks for the detail test instructions!
On the 60 second delay, I likely didn't see this as I was always testing with an existing setup.
Running the test, I get lots of 'database is locked' errors. Here's the output with the test commented out. Note the THREADSAFE=2
which is different than what you show. From here, I think your value of 1 is 'Serialized' where my setting of 2 is 'Multi-thread'.
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/issue_85.py
REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
RUNNING MULTI-THREADING INIT TEST
SQLITE COMPILE OPTIONS
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
CREATING DATA TABLE
RUNNING read/write
Error count: 0
RUNNING write/read
Error count: 0
RUNNING begin/read/write
Error count: 91
RUNNING begin/write/read
Error count: 43
RUNNING begin immediate/read/write
Error count: 8
RUNNING begin immediate/write/read
Error count: 7
RUNNING begin exclusive/read/write
Error count: 36
RUNNING begin exclusive/write/read
Error count: 16
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$
Finally, here's the output from the last three commands you requested:
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/benchmark_core.py -p 1
tests/benchmark_core.py:71: UserWarning: skipping pylibmc
warnings.warn('skipping pylibmc')
tests/benchmark_core.py:88: UserWarning: skipping redis
warnings.warn('skipping redis')
========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.Cache
-------------------------------------------------------------------------------
Action Count Miss Median P90 P99 Max Total
========= ========= ========= ========= ========= ========= ========= =========
get 89115 8714 26.226us 35.048us 50.068us 276.089us 2.463s
set 8941 0 138.044us 184.059us 316.858us 21.123ms 2.116s
delete 943 111 108.004us 163.078us 5.378ms 17.564ms 260.050ms
Total 98999 4.839s
========= ========= ========= ========= ========= ========= ========= =========
========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.FanoutCache(shards=4, timeout=1.0)
-------------------------------------------------------------------------------
Action Count Miss Median P90 P99 Max Total
========= ========= ========= ========= ========= ========= ========= =========
get 89115 8714 27.180us 36.240us 54.121us 250.816us 2.619s
set 8941 0 133.038us 181.913us 755.072us 26.630ms 2.107s
delete 943 111 103.712us 160.933us 787.973us 33.084ms 235.808ms
Total 98999 4.962s
========= ========= ========= ========= ========= ========= ========= =========
========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.FanoutCache(shards=8, timeout=0.010)
-------------------------------------------------------------------------------
Action Count Miss Median P90 P99 Max Total
========= ========= ========= ========= ========= ========= ========= =========
get 89115 8714 27.895us 36.955us 58.174us 5.004ms 2.674s
set 8941 0 137.806us 187.159us 760.078us 43.319ms 2.140s
delete 943 111 105.858us 158.072us 256.062us 19.788ms 184.315ms
Total 98999 4.998s
========= ========= ========= ========= ========= ========= ========= =========
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/benchmark_core.py -p 8
tests/benchmark_core.py:71: UserWarning: skipping pylibmc
warnings.warn('skipping pylibmc')
tests/benchmark_core.py:88: UserWarning: skipping redis
warnings.warn('skipping redis')
========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.Cache
-------------------------------------------------------------------------------
Action Count Miss Median P90 P99 Max Total
========= ========= ========= ========= ========= ========= ========= =========
get 712612 70849 46.968us 106.096us 3.064ms 22.244ms 75.879s
set 71464 0 1.893ms 12.988ms 28.938ms 219.795ms 352.205s
delete 7916 789 1.787ms 12.651ms 28.946ms 127.790ms 38.114s
Total 791992 466.199s
========= ========= ========= ========= ========= ========= ========= =========
Process Process-13:
Traceback (most recent call last):
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "tests/benchmark_core.py", line 96, in worker
obj = kind(*args, **kwargs)
File "/Users/doug/diskcache-test/diskcache/fanout.py", line 38, in __init__
for num in range(shards)
File "/Users/doug/diskcache-test/diskcache/fanout.py", line 38, in <genexpr>
for num in range(shards)
File "/Users/doug/diskcache-test/diskcache/core.py", line 434, in __init__
sql(query, (key, value))
sqlite3.OperationalError: database is locked
Traceback (most recent call last):
File "tests/benchmark_core.py", line 206, in <module>
dispatch()
File "tests/benchmark_core.py", line 165, in dispatch
with open(filename, 'rb') as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'output-4.pkl'
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/issue_85.py
REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
RUNNING MULTI-THREADING INIT TEST
SQLITE COMPILE OPTIONS
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
CREATING DATA TABLE
RUNNING read/write
Error count: 0
RUNNING write/read
Error count: 0
RUNNING begin/read/write
Error count: 91
RUNNING begin/write/read
Error count: 46
RUNNING begin immediate/read/write
Error count: 54
RUNNING begin immediate/write/read
Error count: 12
RUNNING begin exclusive/read/write
Error count: 18
RUNNING begin exclusive/write/read
Error count: 27
Thanks for running the script and the quick reply. It's fascinating to see the different error counts on different machines!
The threadsafe setting difference is interesting but not exactly useful. At least, I'm not sure what it means.
I think the best way forward is to work around the issue. Seems like your retry logic is needed. So some open questions:
time()
? monotonic()
? perf_counter()
? for _ in range(1000): pass
? Does it matter? Interesting chart of sleep() accuracy. I think the goal is for the currently running thread to yield to others. CPython's GIL logic may complicate the answer here.I noticed tonight that the inserts in __init__
do not occur inside a transaction. I wonder if that's the issue. I didn't use a transaction for the whole of SQL statements in __init__
because some statements, like CREATE TABLE, do not work inside transactions. But I'm very careful in the rest of the code to put every INSERT/UPDATE/DELETE inside a transaction. Everywhere but there in __init__
, I use a transaction. My memory is a bit foggy but I recall learning that a transaction was needed around all write operations when managing them myself (using isolation_level=None
in the code).
I updated tests/issue_85.py
and refactored everything into functions. Do you think you could re-run that and paste the output? The new script indicates whether my assumptions about serializability using BEGIN IMMEDIATE
and COMMIT
are true on your machine. Here's the steps again to run the script:
$ git clone git@github.com:grantjenks/python-diskcache.git diskcache-test
$ cd diskcache-test/
$ python3.6 -m venv /tmp/env36
$ source /tmp/env36/bin/activate
$ python -m pip install "Django<2"
$ export PYTHONPATH=`pwd`
$ python tests/issue_85.py
As before, you'll probably have to comment out multi_threading_init_test()
because it will fail on your machine.
Sorry for the delay ...
I pulled in the changes and had to comment out multi_threading_init_test() as you expected. Here's the output:
(arcviz_3.6.2) doug@dougs-mbp:~/diskcache-test$ python tests/issue_85.py
REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
SQLITE COMPILE OPTIONS
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
CREATING DATA TABLE
RUNNING begin/read/write
Error count: 98
Begin count: 100
Commit count: 2
Serialized: False
RUNNING begin/write/read
Error count: 97
Begin count: 100
Commit count: 3
Serialized: False
RUNNING begin immediate/read/write
Error count: 95
Begin count: 5
Commit count: 5
Serialized: True
RUNNING begin immediate/write/read
Error count: 92
Begin count: 8
Commit count: 8
Serialized: True
RUNNING begin exclusive/read/write
Error count: 96
Begin count: 4
Commit count: 4
Serialized: True
RUNNING begin exclusive/write/read
Error count: 96
Begin count: 4
Commit count: 4
Serialized: True
On the sleep/retry question, my thinking is the time/sleep method is best. 1ms sleep is fast but reasonable on current machines. I also prefer testing for elapsed time rather than a loop count as this gives an accurate timeout regardless of how accurate the sleep is or how long each attempt takes.
I continue to believe that the THREADSAFE setting why we are seeing such different results. I spent a bunch of time trying to figure out how to pass the 'flags' parameter to the connect() call and switch the compiled code to use sqlite3_open_v2() as was done here (for I think the pysqlite modules for python2) but getting the whole build environment set up for the python3 was taking more time than I had available.
While I'd love to confirm the above (as I see lots of folks having this 'database is locked' error in many contexts), I think for diskcache it doesn't really matter since we want it to work with folk's current python environments.
Ok, having thought about it and played with the code, here's my answers:
__init__
and inside reset
. These are both special cases and trying to use the same logic everywhere caused other problems.time.time()
is good enough.I have a potential fix implemented in branch issue-85
. Would you mind pulling that branch and running the tests? Something like:
$ git clone git@github.com:grantjenks/python-diskcache.git diskcache-test
$ cd diskcache-test/
$ git checkout issue-85
$ python3.6 -m venv /tmp/env36
$ source /tmp/env36/bin/activate
$ python -m pip install -r requirements.txt
$ nosetests
All the tests should pass.
Could you also try it on your DjangoCache test case with multiple threads?
Assuming all the tests pass, I'll merge the pull request (#86) and deploy a new version.
Thank you for the sql-retry code.
Thanks for the update. I'm still getting a lot of errors that I need to dig into.
In one case, I'm getting a lot of 'sqlite3.OperationalError: locking protocol' failures on the stress_test_deque_mp.py test. The output is messy due to multiple processes all printing tracebacks at the same time, but starts like:
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/stress_test_deque_mp.py
Process Process-4:
Traceback (most recent call last):
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "tests/stress_test_deque_mp.py", line 106, in stress
if len(deque) > 100:
File "/Users/doug/diskcache-test/diskcache/persistent.py", line 296, in __len__
return len(self._cache)
File "/Users/doug/diskcache-test/diskcache/core.py", line 1850, in __len__
return self.reset('count')
File "/Users/doug/diskcache-test/diskcache/core.py", line 1893, in reset
(value,), = sql(select, (key,)).fetchall()
File "/Users/doug/diskcache-test/diskcache/core.py", line 588, in _execute_with_retry
return con.execute(statement, *args, **kwargs)
MemoryError
Process Process-1:
Process Process-3:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "tests/stress_test_deque_mp.py", line 106, in stress
if len(deque) > 100:
File "/Users/doug/diskcache-test/diskcache/persistent.py", line 296, in __len__
return len(self._cache)
File "/Users/doug/diskcache-test/diskcache/core.py", line 1850, in __len__
return self.reset('count')
File "/Users/doug/diskcache-test/diskcache/core.py", line 1893, in reset
(value,), = sql(select, (key,)).fetchall()
File "/Users/doug/diskcache-test/diskcache/core.py", line 588, in _execute_with_retry
return con.execute(statement, *args, **kwargs)
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "tests/stress_test_deque_mp.py", line 106, in stress
if len(deque) > 100:
sqlite3.OperationalError: locking protocol
I found this link https://stackoverflow.com/questions/46331178/causes-of-sqlite3-operationalerror-locking-protocol-exception which seems to apply.
I then added a close at start of the target stress function so that each process would open a new connection:
def stress(seed, deque):
deque._cache.close()
random.seed(seed)
And with this change, the test ran clean. I think other mp tests might need this same type of change but haven't gone through them yet.
I got to a set of changes so that 30 iterations of the nosetests runs clean:
With changes 1-4, I could get to about 5-10 iterations of nosetests before a failure but needed change 5 to get to (at least) 30 iterations without errors. What lead me there was adding some instrumentation and finding that occasionally a single PRAGMA operation was taking 50 or more seconds (when multiple operations where ongoing) before failing with 'database is locked'. Specifically the 'PRAGMA journal_mode = wal' and 'PRAGMA auto_vacuum = 1' operations.
This seems like a reasonable set of changes. Here's the diff that you can consider:
diff --git a/diskcache/core.py b/diskcache/core.py
index d7de9a2..e42abef 100644
--- a/diskcache/core.py
+++ b/diskcache/core.py
@@ -114,6 +114,7 @@ EVICTION_POLICY = {
},
}
+PROCESS_LOCK = threading.Lock()
class Disk(object):
"Cache key and value serialization for SQLite database and files."
@@ -407,12 +408,12 @@ class Cache(object):
sets.pop(key, None)
# Chance to set pragmas before any tables are created.
+ with PROCESS_LOCK:
+ for key, value in sorted(sets.items()):
+ if not key.startswith('sqlite_'):
+ continue
- for key, value in sorted(sets.items()):
- if not key.startswith('sqlite_'):
- continue
-
- self.reset(key, value, update=False)
+ self.reset(key, value, update=False)
sql('CREATE TABLE IF NOT EXISTS Settings ('
' key TEXT NOT NULL UNIQUE,'
@@ -538,9 +539,14 @@ class Cache(object):
@property
def _con(self):
+ pid = getattr(self._local, 'pid', None)
con = getattr(self._local, 'con', None)
+ if pid != os.getpid() and con is not None:
+ self.close()
+ con = None
if con is None:
+ pid = self._local.pid = os.getpid()
con = self._local.con = sqlite3.connect(
op.join(self._directory, DBNAME),
timeout=self._timeout,
@@ -587,7 +593,7 @@ class Cache(object):
try:
return con.execute(statement, *args, **kwargs)
except sqlite3.OperationalError as exc:
- if exc.args[0] != 'database is locked':
+ if not 'database is locked' in str(exc):
raise
diff = time.time() - start
if diff > 60:
@@ -600,11 +606,12 @@ class Cache(object):
@cl.contextmanager
def _transact(self, filename=None):
sql = self._sql
+ sql_retry = self._sql_retry
filenames = []
_disk_remove = self._disk.remove
try:
- sql('BEGIN IMMEDIATE')
+ sql_retry('BEGIN IMMEDIATE')
except sqlite3.OperationalError:
if filename is not None:
_disk_remove(filename)
@@ -1912,7 +1919,9 @@ class Cache(object):
# ``_execute_with_retry`` in ``self._sql_retry``.
pragma = key[7:]
- sql('PRAGMA %s = %s' % (pragma, value)).fetchall()
+ curval = sql('PRAGMA %s' % (pragma)).fetchall()[0][0]
+ if curval != value:
+ sql('PRAGMA %s = %s' % (pragma, value)).fetchall()
elif key.startswith('disk_'):
attr = key[5:]
setattr(self._disk, attr, value)
diff --git a/tests/test_core.py b/tests/test_core.py
index d58e013..58d6fc5 100644
--- a/tests/test_core.py
+++ b/tests/test_core.py
@@ -187,6 +187,7 @@ def test_pragma_error(cache):
fetchall = mock.Mock()
local.con = con
+ local.pid = os.getpid()
con.execute = execute
execute.return_value = cursor
cursor.fetchall = fetchall
@@ -336,6 +337,7 @@ def test_set_timeout(cache):
execute = mock.Mock()
local.con = con
+ local.pid = os.getpid()
con.execute = execute
execute.side_effect = sqlite3.OperationalError
@@ -954,6 +956,7 @@ def test_add_timeout(cache):
execute = mock.Mock()
local.con = con
+ local.pid = os.getpid()
con.execute = execute
execute.side_effect = sqlite3.OperationalError
It looks to me that stress_test_deque_mp.py is simply broken. I don't know why it works on my machine. I expected multiprocessing to pickle all the arguments to the process, but it does not. After some print-debugging and careful reading of the docs, multiprocessing does not pickle the arguments for the "fork" start method. I should probably fix that and see where else that assumption is invalid.
Your changes are reasonable but let me ask a few questions:
Ok, good idea.
If I use multiprocessing correctly, can you get rid of the pid check? That may be a worthwhile feature but it's not part of the current design. In the docs, I explicitly state that diskcache objects do not survive forking. But this would solve that issue, I think. Looking up the pid takes 91ns on my machine so I think we can afford it.
Why only set the pragma if it's changed? Just an optimization, or something more?
Why retry on BEGIN IMMEDIATE? I think it should be ok for the API to just Timeout early. Is this just to make the timeout more accurate?
The PROCESS_LOCK only seems to guard setting the pragmas before any tables are created. Pragmas are kind of quirky directives so I can see how this (5) combined with (3) would reduce a lot of pragma setting conflicts. The 50 second delay that you saw, that wasn't simply _sql_retry
repeating for 60 seconds during initialization, was it? Also, should this be a multiprocessing.Lock
instead? Can there be pragmas conflicts from different processes?
As I look at them, your changes are growing on me. I want diskcache to be bullet proof but I also need the source to be maintainable.
I'm sorry I can't reproduce your issues locally. That would make things much simpler. Out of curiosity, where does your Python executable come from? Is it CPython from python.org? Macports? Homebrew? Custom build?
Definitely would be good if you could repro these issues. I continue to believe the THREADSAFE setting is the difference. Here are the instructions I used to set up my env:
## Download and install Python 3.6.2 using pyenv
$ brew update && brew upgrade pyenv
$ brew update && brew upgrade # Optional
## Note: You might need to start a new terminal shell after this step
$ pyenv install 3.6.2
Installing Python-3.6.2...
Installed Python-3.6.2 to /Users/----/.pyenv/versions/3.6.2
## Verify 3.6.2 is installed
$ pyenv versions
system
2.6.6
2.7
3.6.2
For (2), yes if the tests are fixed and the docs say that cache objects don't survive forks, then dropping this change seems ok. Although it's not an unreasonable protection--either as is or switched to an assert to enforce users don't attempt to use cache objects across a fork.
For (3), since a lot of the errors were on these operations, I was guessing these calls trigger long running operations and trying to reduce the need for the making these calls. Likely this change doesn't matter.
For (4), I didn't look into the context so I don't know the answer your questions as this time but without this change I get errors like:
(env36) doug@dougs-mbp:~/diskcache-test$ for i in `seq 1 30` ; do echo $i; date; nosetests; done
1
Sun Nov 4 08:14:48 PST 2018
....Exception in thread Thread-12:
Traceback (most recent call last):
File "/Users/doug/diskcache-test/diskcache/core.py", line 614, in _transact
sql('BEGIN IMMEDIATE')
sqlite3.OperationalError: database is locked
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/Users/doug/diskcache-test/tests/stress_test_core.py", line 134, in worker
cache.set(key, value, expire=EXPIRE)
File "/Users/doug/diskcache-test/diskcache/core.py", line 675, in set
with self._transact(filename) as (sql, cleanup):
File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/Users/doug/diskcache-test/diskcache/core.py", line 618, in _transact
raise Timeout
diskcache.core.Timeout
Exception in thread Thread-9:
Traceback (most recent call last):
File "/Users/doug/diskcache-test/diskcache/core.py", line 614, in _transact
sql('BEGIN IMMEDIATE')
sqlite3.OperationalError: database is locked
During handling of the above exception, another exception occurred:
For (5), correct that the 50 second (and longer) delay was not the _sql_retry delay but the time for a single execute operation within the _sql_retry loop before it failed with 'database is locked'. Note that given this timing, the previous pragma retry loop of 60000 cycles is essentially 'retry forever until this passes'. I believe you are also correct that different processes could conflict so a multiprocessing lock would in some sense be safer but these are also more complex and seem suited to single multiprocessing applications -- for example what if multiple completely separate applications are using diskcache with the same cache location? If we really want single access during init, then I think adding a file system lock perhaps like https://pypi.org/project/filelock/ makes more sense.
Finally, got one error on the nosetests which I don't think is related to this issue but want to capture it here:
Sun Nov 4 08:29:10 PST 2018
...........E.......................................................................................................................................................................................................................................................
======================================================================
ERROR: Stress test multiple threads and processes.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/env36/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/Users/doug/diskcache-test/tests/stress_test_fanout.py", line 281, in stress_test_mp
stress_test(processes=4, threads=4)
File "/Users/doug/diskcache-test/tests/stress_test_fanout.py", line 249, in stress_test
with open('output-%s.pkl' % num, 'rb') as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'output-1.pkl'
----------------------------------------------------------------------
Ran 259 tests in 273.576s
FAILED (errors=1)
With your brew
and pyenv
instructions, I was able to build python with the same compilation flags for SQLite and reproduce the bug! I also think I've now fixed all the bugs! I've combined the changes in the issue-85 branch and updated the pull request. Travis and AppVeyor are running now.
I was able to avoid a file-lock or any kind of lock around the pragma-setting code during initialization. The trick was to set the timeout to zero and handle the retries manually while also doing the check for the old value of the pragma. That takes care of (5).
I integrated (1), (2), and (3) from your changes. Thank you for troubleshooting on your end. I particularly like that the code is now fork-safe.
Some of the tests were bogus and have been fixed. That takes care of (4).
Concurrency bugs are hard but I ran the tests repeatedly (thousands of times) overnight without issue. I think we can declare victory. It should also be more robust to CI testing.
Before I merge the issue-85 branch and deploy the changes, would you test that it's all working on your end? If you you can run the test suite repeatedly then I think we can count it fixed :)
Super that with the brew install you were able to reproduce the issues I was seeing.
I ran 30 nosetests cycles with the new code without issues.
Great to avoid any sort of lock on the initialization PRAGMA commands -- I think checking the setting value before each update attempt is key.
Thanks for working through this!
Released version 3.1.0
Current django cache creates a cache instance per thread so each initial cache access for a thread results in the cache initialization running.
This has code to store various settings into to the sql table (from diskcache/core.py/Cache.init():
If multiple threads are started at the same time, this first cache access can hit a 'database locked' error during these writes. This is easy to demonstrate:
This results in a bunch of errors from the threads:
This is with a very generic cache configuration:
We could put the initialization inside a lock, but I'm thinking maybe for diskcache, we don't want each django thread using a private cache object.