sqlite3.OperationalError and Timeout During Initialization from Multiple Threads

grantjenks / python-diskcache

Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

http://www.grantjenks.com/docs/diskcache/

Other

2.4k stars 134 forks source link

sqlite3.OperationalError and Timeout During Initialization from Multiple Threads #85

Closed dougjc closed 6 years ago

dougjc commented 6 years ago

Current django cache creates a cache instance per thread so each initial cache access for a thread results in the cache initialization running.

This has code to store various settings into to the sql table (from diskcache/core.py/Cache.init():

       # Set cached attributes: updates settings and sets pragmas.

        for key, value in sets.items():
            query = 'INSERT OR REPLACE INTO Settings VALUES (?, ?)'
            sql(query, (key, value))
            self.reset(key, value)

        for key, value in METADATA.items():
            query = 'INSERT OR IGNORE INTO Settings VALUES (?, ?)'
            sql(query, (key, value))
            self.reset(key)

If multiple threads are started at the same time, this first cache access can hit a 'database locked' error during these writes. This is easy to demonstrate:

(arcviz_3.6.2) doug@Dougs-MacBook-Pro:$ python manage.py shell
Python 3.6.2 (default, Sep  6 2017, 18:33:29)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from django.core.cache import cache
>>> def f():
...   cache.get('key')
...
>>> for i in range(50):
...   threading.Thread(target=f).start()
...

This results in a bunch of errors from the threads:

>>> Exception in thread Thread-11:
Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 574, in _transact
    sql('BEGIN IMMEDIATE')
sqlite3.OperationalError: database is locked

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "<console>", line 2, in f
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 99, in __getattr__
    return getattr(caches[DEFAULT_CACHE_ALIAS], name)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 80, in __getitem__
    cache = _create_cache(alias)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 55, in _create_cache
    return backend_cls(location, params)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/djangocache.py", line 28, in __init__
    self._cache = FanoutCache(directory, shards, timeout, **options)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/fanout.py", line 38, in __init__
    for num in range(shards)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/fanout.py", line 38, in <genexpr>
    for num in range(shards)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 435, in __init__
    self.reset(key, value)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 1863, in reset
    with self._transact() as (sql, _):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 578, in _transact
    raise Timeout
diskcache.core.Timeout

This is with a very generic cache configuration:

CACHES = {
  'default': {
    'BACKEND': 'diskcache.DjangoCache',
    'LOCATION': os.path.expanduser("~/.arc/cache"),
    'SHARDS': 4,
    'DATABASE_TIMEOUT': 10.0,
    'OPTIONS': {
        'size_limit': 1 * (2 ** 30)  # 1 gigabytes
    },
  },
}

We could put the initialization inside a lock, but I'm thinking maybe for diskcache, we don't want each django thread using a private cache object.

grantjenks commented 6 years ago

I've seen this issue before but there's a workaround that works for me.

I tried your repro case and the code passed without issue:

(env36) 2018-10-23 21:29:14 grantj @ Grants-MacBook-Pro : ~/repos/python-diskcache 
$ ipython
Python 3.6.5 (default, Mar 29 2018, 15:37:32) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import os

In [2]: os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'tests.settings')
Out[2]: 'tests.settings'

In [3]: import django

In [4]: django.setup()

In [5]: from django.core.cache import cache

In [6]: def f():
   ...:     cache.get('key')
   ...:     

In [7]: import threading

In [8]: for i in range(50):
   ...:     threading.Thread(target=f).start()
   ...:     

In [9]:

But I've seen failures like yours in CI tests and not been able to reproduce them.

The workaround is at https://github.com/grantjenks/python-diskcache/blob/master/diskcache/core.py#L377. The high timeout (60 seconds) means that BEGIN IMMEDIATE should not raise an exception for 60 seconds.

I've seen issues with this and PRAGMAs but not with regular inserts. You'll find this comment documenting the workaround for PRAGMA statements: https://github.com/grantjenks/python-diskcache/blob/master/diskcache/core.py#L1871

How long does your snippet run before throwing the exception?

Multi-threaded and multi-process programming is tricky so I may be missing something obvious. I didn't understand your comment about "each django thread using a private cache object." Could you explain more?

dougjc commented 6 years ago

My snippet fails immediately which makes me think the DATABASE_TIMEOUT is not being respected.

On the "each django thread using a private cache object.", note here https://github.com/django/django/blob/master/django/core/cache/__init__.py#L61 that the CacheHandler uses thread local storage for the self._caches variable so that each thread gets its own instance of the cache object returned from _cache_create(). You can see this path in the backtrace in the initial comments.

grantjenks commented 6 years ago

Interesting. I hadn’t seen that code in Django before. But it shouldn’t matter. Disk Cache is intended to be entirely thread and process safe.

What’s your platform like? I’m running locally on a Mac with CPython 3.6 built from macports.

You might also try inserting the retry logic that I pointed to around pragmas around the insert/update code for settings. Shouldn’t be necessary but I’d be willing to accept it as a workaround. I wonder if updating pragmas could cause issues for other threads. Unfortunately the SQLite docs don’t indicate what kind of locks are acquired for each pragma (but I should maybe ask).

dougjc commented 6 years ago

My platform is a Macbook pro with SSD and you can see the python build info in the output at the start of the first comment.

But I don't think that is the issue...more likely a timing issue. Could you share the cache configuration you are using?

I think you are correct on that the django cache object per thread should not matter -- it just makes it much more likely to have multiple threads running through the initialization in parallel doing all the setup commands and thus hitting the sqlite 'database is locked' error.

Hunting around, I think this is the same generic sqlite issue as described here.

I updated the _sql property to wrap a retry on sqlite3.OperationalError and that takes are of the issue. By adding print I can see the retry is getting executed.

From

        return con.execute

        def sqlr(statement, *args, **kwargs):
          start = None
          while True:
            try:
              return con.execute(statement, *args, **kwargs)
            except sqlite3.OperationalError:
              start = start or time.time()
              if time.time() - start > self._timeout:
                raise
              time.sleep(0.01)

        return sqlr

grantjenks commented 6 years ago

Clever! The sqlr function simulated the expected timeout/retry behavior.

Could you refactor that to a separate _sql_retry property that calls _sql and wraps the function? Your current changes make core.Cache into a FanoutCache (no Timeout exceptions). I’d also like to verify that the retry behavior is only needed inside the initializer method.

I just saw the SQLite pragma for busy timeout: https://www.sqlite.org/pragma.html#pragma_busy_timeout could you query that on your machine?

dougjc commented 6 years ago

I had already run with forcing the pragma busy timeout after the initial connection, but it made no difference, this plus the immediate exception is what leads me to believe that the busy timeout is not the issue but rather the issue as described in the link 'sqlite database locked' is unavoidable when using multiple connections.

Weird that you are not able to reproduce the issue. How many shards do you have in your test configuration? I think increasing this number would make it more likely to hit the error. Also, what sort of disk are the shards using? This would also affect the timing.

The only other difference I can think of is the version/configuration of sqlite itself. I have I believe a standard install:

SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> pragma compile_options;
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
sqlite>

I can also see that the errors are also occurring in the normal cache.set() operations but they are hidden/handled by the fanoutCache Timeout/retry mechanism.

On the _sql_retry request, you are requesting that this be used only for the __init__() routine? If we used it within the reset() routine as well, we could remove the pragma retry stuff.

grantjenks commented 6 years ago

Yes, I think "database is locked" may be unavoidable but I'm having difficulty pinning down the issue. I thought I had a reproducible version of the issue last night but it has disappeared this morning :( Oddly, it disappeared after I ran the benchmarks to assess any performance penalty of your retry-function wrapper.

In my local testing, I am using the defaults: 8 shards, 10ms timeout, 1GB size limit, standard core.Disk type.

Because SQLite is embedded, the compile options of your SQLite executable and the one embedded in Python are likely different. I've written a small script to exercise the topics we've discussed in this issue and pushed it to the master branch. Would you mind cloning the repo and running the following:

$ git clone git@github.com:grantjenks/python-diskcache.git diskcache-test
$ cd diskcache-test/
$ python3.6 -m venv /tmp/env36
$ source /tmp/env36/bin/activate
$ python -m pip install "Django<2"
$ export PYTHONPATH=`pwd`
$ python tests/issue_85.py

For that last command, what's your output? Mine looks like:

REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
RUNNING MULTI-THREADING INIT TEST
SQLITE COMPILE OPTIONS
COMPILER=clang-9.1.0
DISABLE_INTRINSIC
ENABLE_COLUMN_METADATA
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_RTREE
ENABLE_STAT4
ENABLE_UNLOCK_NOTIFY
SECURE_DELETE
SOUNDEX
THREADSAFE=1
CREATING DATA TABLE
RUNNING read/write
Error count: 0
RUNNING write/read
Error count: 0
RUNNING begin/read/write
Error count: 85
RUNNING begin/write/read
Error count: 0
RUNNING begin immediate/read/write
Error count: 0
RUNNING begin immediate/write/read
Error count: 0
RUNNING begin exclusive/read/write
Error count: 0
RUNNING begin exclusive/write/read
Error count: 0

If you can't pass the "RUNNING MULTI-THREADING INIT TEST" section then would you comment out that code and include the "SQLITE COMPILE OPTIONS" along with everything under "CREATING DATA TABLE". Thanks so much for your help!

What I see in the output matches my understanding of SQLite. The only errors occur from transactions that use the deferred lock and so get stuck between reads/writes. All the other scenarios work without issue.

Assuming you have different error counts, could I ask you to run two more scripts:

$ python tests/benchmark_core.py -p 1
$ python tests/benchmark_core.py -p 8
$ python tests/issue_85.py

I'm just baffled that running the benchmarks could impact the transactions but that's the behavior I observed this morning!?!?@!#%$@#

I also had to modify your retry-wrapper. Now, inside Cache._sql I've experimented with:

        con = self._con

        def execute_with_retry(statement, *args, **kwargs):
            start = time.time()
            while True:
                try:
                    return con.execute(statement, *args, **kwargs)
                except sqlite3.OperationalError as exc:
                    if exc.args[0] != 'database is locked':
                        raise
                    diff = time.time() - start
                    if diff > self._timeout:
                        raise
                    time.sleep(0.001)

        return execute_with_retry

I had to add the if exc.args[0] != 'database is locked': check. Without that check, it gets stuck waiting for 60 seconds on the missing Settings table during initialization. Did you not have that issue? (Note that I do not have the above patch applied for any of the script output shown above.)

dougjc commented 6 years ago

Thanks for the detail test instructions!

On the 60 second delay, I likely didn't see this as I was always testing with an existing setup.

Running the test, I get lots of 'database is locked' errors. Here's the output with the test commented out. Note the THREADSAFE=2 which is different than what you show. From here, I think your value of 1 is 'Serialized' where my setting of 2 is 'Multi-thread'.

(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/issue_85.py
REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
RUNNING MULTI-THREADING INIT TEST
SQLITE COMPILE OPTIONS
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
CREATING DATA TABLE
RUNNING read/write
Error count: 0
RUNNING write/read
Error count: 0
RUNNING begin/read/write
Error count: 91
RUNNING begin/write/read
Error count: 43
RUNNING begin immediate/read/write
Error count: 8
RUNNING begin immediate/write/read
Error count: 7
RUNNING begin exclusive/read/write
Error count: 36
RUNNING begin exclusive/write/read
Error count: 16
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$

Finally, here's the output from the last three commands you requested:

(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/benchmark_core.py -p 1
tests/benchmark_core.py:71: UserWarning: skipping pylibmc
  warnings.warn('skipping pylibmc')
tests/benchmark_core.py:88: UserWarning: skipping redis
  warnings.warn('skipping redis')

========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.Cache
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get     89115      8714  26.226us  35.048us  50.068us 276.089us   2.463s
      set      8941         0 138.044us 184.059us 316.858us  21.123ms   2.116s
   delete       943       111 108.004us 163.078us   5.378ms  17.564ms 260.050ms
    Total     98999                                                     4.839s
========= ========= ========= ========= ========= ========= ========= =========

========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.FanoutCache(shards=4, timeout=1.0)
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get     89115      8714  27.180us  36.240us  54.121us 250.816us   2.619s
      set      8941         0 133.038us 181.913us 755.072us  26.630ms   2.107s
   delete       943       111 103.712us 160.933us 787.973us  33.084ms 235.808ms
    Total     98999                                                     4.962s
========= ========= ========= ========= ========= ========= ========= =========

========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.FanoutCache(shards=8, timeout=0.010)
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get     89115      8714  27.895us  36.955us  58.174us   5.004ms   2.674s
      set      8941         0 137.806us 187.159us 760.078us  43.319ms   2.140s
   delete       943       111 105.858us 158.072us 256.062us  19.788ms 184.315ms
    Total     98999                                                     4.998s
========= ========= ========= ========= ========= ========= ========= =========

(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/benchmark_core.py -p 8
tests/benchmark_core.py:71: UserWarning: skipping pylibmc
  warnings.warn('skipping pylibmc')
tests/benchmark_core.py:88: UserWarning: skipping redis
  warnings.warn('skipping redis')

========= ========= ========= ========= ========= ========= ========= =========
Timings for diskcache.Cache
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get    712612     70849  46.968us 106.096us   3.064ms  22.244ms  75.879s
      set     71464         0   1.893ms  12.988ms  28.938ms 219.795ms 352.205s
   delete      7916       789   1.787ms  12.651ms  28.946ms 127.790ms  38.114s
    Total    791992                                                   466.199s
========= ========= ========= ========= ========= ========= ========= =========

Process Process-13:
Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "tests/benchmark_core.py", line 96, in worker
    obj = kind(*args, **kwargs)
  File "/Users/doug/diskcache-test/diskcache/fanout.py", line 38, in __init__
    for num in range(shards)
  File "/Users/doug/diskcache-test/diskcache/fanout.py", line 38, in <genexpr>
    for num in range(shards)
  File "/Users/doug/diskcache-test/diskcache/core.py", line 434, in __init__
    sql(query, (key, value))
sqlite3.OperationalError: database is locked
Traceback (most recent call last):
  File "tests/benchmark_core.py", line 206, in <module>
    dispatch()
  File "tests/benchmark_core.py", line 165, in dispatch
    with open(filename, 'rb') as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'output-4.pkl'
(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/issue_85.py
REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
RUNNING MULTI-THREADING INIT TEST
SQLITE COMPILE OPTIONS
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
CREATING DATA TABLE
RUNNING read/write
Error count: 0
RUNNING write/read
Error count: 0
RUNNING begin/read/write
Error count: 91
RUNNING begin/write/read
Error count: 46
RUNNING begin immediate/read/write
Error count: 54
RUNNING begin immediate/write/read
Error count: 12
RUNNING begin exclusive/read/write
Error count: 18
RUNNING begin exclusive/write/read
Error count: 27

grantjenks commented 6 years ago

Thanks for running the script and the quick reply. It's fascinating to see the different error counts on different machines!

The threadsafe setting difference is interesting but not exactly useful. At least, I'm not sure what it means.

I think the best way forward is to work around the issue. Seems like your retry logic is needed. So some open questions:

Retry everywhere or only during initialization? I'm inclined to think the retry should apply everywhere but what happens inside a BEGIN IMMEDIATE / COMMIT transaction?
Sleep for how long? Yours was 10ms? Mine is 1ms? Maybe 0ms? Seems like the SQLITE_BUSY occurs during a COMMIT so we should wait for the write before trying again. Most writes are pretty quick, maybe 1ms? I don't want one thread/process to get starved for access.
What's the most trustworthy way to get the time and sleep? time()? monotonic()? perf_counter()? for _ in range(1000): pass? Does it matter? Interesting chart of sleep() accuracy. I think the goal is for the currently running thread to yield to others. CPython's GIL logic may complicate the answer here.

grantjenks commented 6 years ago

I noticed tonight that the inserts in __init__ do not occur inside a transaction. I wonder if that's the issue. I didn't use a transaction for the whole of SQL statements in __init__ because some statements, like CREATE TABLE, do not work inside transactions. But I'm very careful in the rest of the code to put every INSERT/UPDATE/DELETE inside a transaction. Everywhere but there in __init__, I use a transaction. My memory is a bit foggy but I recall learning that a transaction was needed around all write operations when managing them myself (using isolation_level=None in the code).

I updated tests/issue_85.py and refactored everything into functions. Do you think you could re-run that and paste the output? The new script indicates whether my assumptions about serializability using BEGIN IMMEDIATE and COMMIT are true on your machine. Here's the steps again to run the script:

$ git clone git@github.com:grantjenks/python-diskcache.git diskcache-test
$ cd diskcache-test/
$ python3.6 -m venv /tmp/env36
$ source /tmp/env36/bin/activate
$ python -m pip install "Django<2"
$ export PYTHONPATH=`pwd`
$ python tests/issue_85.py

As before, you'll probably have to comment out multi_threading_init_test() because it will fail on your machine.

dougjc commented 6 years ago

Sorry for the delay ...

I pulled in the changes and had to comment out multi_threading_init_test() as you expected. Here's the output:

(arcviz_3.6.2) doug@dougs-mbp:~/diskcache-test$ python tests/issue_85.py
REMOVING CACHE DIRECTORY
INITIALIZING DJANGO
SQLITE COMPILE OPTIONS
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_LOCKING_STYLE=1
ENABLE_PREUPDATE_HOOK
ENABLE_RTREE
ENABLE_SESSION
ENABLE_SNAPSHOT
ENABLE_SQLLOG
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
HAVE_ISNAN
MAX_LENGTH=2147483645
MAX_MMAP_SIZE=1073741824
MAX_VARIABLE_NUMBER=500000
OMIT_AUTORESET
OMIT_LOAD_EXTENSION
STMTJRNL_SPILL=131072
THREADSAFE=2
USE_URI
CREATING DATA TABLE
RUNNING begin/read/write
Error count: 98
Begin count: 100
Commit count: 2
Serialized: False
RUNNING begin/write/read
Error count: 97
Begin count: 100
Commit count: 3
Serialized: False
RUNNING begin immediate/read/write
Error count: 95
Begin count: 5
Commit count: 5
Serialized: True
RUNNING begin immediate/write/read
Error count: 92
Begin count: 8
Commit count: 8
Serialized: True
RUNNING begin exclusive/read/write
Error count: 96
Begin count: 4
Commit count: 4
Serialized: True
RUNNING begin exclusive/write/read
Error count: 96
Begin count: 4
Commit count: 4
Serialized: True

On the sleep/retry question, my thinking is the time/sleep method is best. 1ms sleep is fast but reasonable on current machines. I also prefer testing for elapsed time rather than a loop count as this gives an accurate timeout regardless of how accurate the sleep is or how long each attempt takes.

I continue to believe that the THREADSAFE setting why we are seeing such different results. I spent a bunch of time trying to figure out how to pass the 'flags' parameter to the connect() call and switch the compiled code to use sqlite3_open_v2() as was done here (for I think the pysqlite modules for python2) but getting the whole build environment set up for the python3 was taking more time than I had available.

While I'd love to confirm the above (as I see lots of folks having this 'database is locked' error in many contexts), I think for diskcache it doesn't really matter since we want it to work with folk's current python environments.

grantjenks commented 6 years ago

Ok, having thought about it and played with the code, here's my answers:

Retry only during __init__ and inside reset. These are both special cases and trying to use the same logic everywhere caused other problems.
Sleep for 1ms is OK. If someone could make a good argument for 0ms, I'd be open to that too.
time.time() is good enough.

I have a potential fix implemented in branch issue-85. Would you mind pulling that branch and running the tests? Something like:

$ git clone git@github.com:grantjenks/python-diskcache.git diskcache-test
$ cd diskcache-test/
$ git checkout issue-85
$ python3.6 -m venv /tmp/env36
$ source /tmp/env36/bin/activate
$ python -m pip install -r requirements.txt
$ nosetests

All the tests should pass.

Could you also try it on your DjangoCache test case with multiple threads?

Assuming all the tests pass, I'll merge the pull request (#86) and deploy a new version.

Thank you for the sql-retry code.

dougjc commented 6 years ago

Thanks for the update. I'm still getting a lot of errors that I need to dig into.

In one case, I'm getting a lot of 'sqlite3.OperationalError: locking protocol' failures on the stress_test_deque_mp.py test. The output is messy due to multiple processes all printing tracebacks at the same time, but starts like:

(env36) doug@Dougs-MacBook-Pro:~/diskcache-test$ python tests/stress_test_deque_mp.py
Process Process-4:
Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "tests/stress_test_deque_mp.py", line 106, in stress
    if len(deque) > 100:
  File "/Users/doug/diskcache-test/diskcache/persistent.py", line 296, in __len__
    return len(self._cache)
  File "/Users/doug/diskcache-test/diskcache/core.py", line 1850, in __len__
    return self.reset('count')
  File "/Users/doug/diskcache-test/diskcache/core.py", line 1893, in reset
    (value,), = sql(select, (key,)).fetchall()
  File "/Users/doug/diskcache-test/diskcache/core.py", line 588, in _execute_with_retry
    return con.execute(statement, *args, **kwargs)
MemoryError
Process Process-1:
Process Process-3:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "tests/stress_test_deque_mp.py", line 106, in stress
    if len(deque) > 100:
  File "/Users/doug/diskcache-test/diskcache/persistent.py", line 296, in __len__
    return len(self._cache)
  File "/Users/doug/diskcache-test/diskcache/core.py", line 1850, in __len__
    return self.reset('count')
  File "/Users/doug/diskcache-test/diskcache/core.py", line 1893, in reset
    (value,), = sql(select, (key,)).fetchall()
  File "/Users/doug/diskcache-test/diskcache/core.py", line 588, in _execute_with_retry
    return con.execute(statement, *args, **kwargs)
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "tests/stress_test_deque_mp.py", line 106, in stress
    if len(deque) > 100:
sqlite3.OperationalError: locking protocol

I found this link https://stackoverflow.com/questions/46331178/causes-of-sqlite3-operationalerror-locking-protocol-exception which seems to apply.

I then added a close at start of the target stress function so that each process would open a new connection:

def stress(seed, deque):
    deque._cache.close()
    random.seed(seed)

And with this change, the test ran clean. I think other mp tests might need this same type of change but haven't gone through them yet.

dougjc commented 6 years ago

I got to a set of changes so that 30 iterations of the nosetests runs clean:

Don't use 'exc.args[0]' to check exception type as there was one exception where this was not valid.
Move re-open after fork logic to the core.py by tracking process ID during connection.
Update reset logic to check PRAGMA value and only issue the set if needed.
Use sql_retry on the BEGIN IMMEDIATE operation.
Use a process level lock to ensure only one initialization operation active per process.

With changes 1-4, I could get to about 5-10 iterations of nosetests before a failure but needed change 5 to get to (at least) 30 iterations without errors. What lead me there was adding some instrumentation and finding that occasionally a single PRAGMA operation was taking 50 or more seconds (when multiple operations where ongoing) before failing with 'database is locked'. Specifically the 'PRAGMA journal_mode = wal' and 'PRAGMA auto_vacuum = 1' operations.

This seems like a reasonable set of changes. Here's the diff that you can consider:

diff --git a/diskcache/core.py b/diskcache/core.py
index d7de9a2..e42abef 100644
--- a/diskcache/core.py
+++ b/diskcache/core.py
@@ -114,6 +114,7 @@ EVICTION_POLICY = {
     },
 }

+PROCESS_LOCK = threading.Lock()

 class Disk(object):
     "Cache key and value serialization for SQLite database and files."
@@ -407,12 +408,12 @@ class Cache(object):
             sets.pop(key, None)

         # Chance to set pragmas before any tables are created.
+        with PROCESS_LOCK:
+          for key, value in sorted(sets.items()):
+              if not key.startswith('sqlite_'):
+                  continue

-        for key, value in sorted(sets.items()):
-            if not key.startswith('sqlite_'):
-                continue
-
-            self.reset(key, value, update=False)
+              self.reset(key, value, update=False)

         sql('CREATE TABLE IF NOT EXISTS Settings ('
             ' key TEXT NOT NULL UNIQUE,'
@@ -538,9 +539,14 @@ class Cache(object):

     @property
     def _con(self):
+        pid = getattr(self._local, 'pid', None)
         con = getattr(self._local, 'con', None)
+        if pid != os.getpid() and con is not None:
+          self.close()
+          con = None

         if con is None:
+            pid = self._local.pid = os.getpid()
             con = self._local.con = sqlite3.connect(
                 op.join(self._directory, DBNAME),
                 timeout=self._timeout,
@@ -587,7 +593,7 @@ class Cache(object):
                 try:
                     return con.execute(statement, *args, **kwargs)
                 except sqlite3.OperationalError as exc:
-                    if exc.args[0] != 'database is locked':
+                    if not 'database is locked' in str(exc):
                         raise
                     diff = time.time() - start
                     if diff > 60:
@@ -600,11 +606,12 @@ class Cache(object):
     @cl.contextmanager
     def _transact(self, filename=None):
         sql = self._sql
+        sql_retry = self._sql_retry
         filenames = []
         _disk_remove = self._disk.remove

         try:
-            sql('BEGIN IMMEDIATE')
+            sql_retry('BEGIN IMMEDIATE')
         except sqlite3.OperationalError:
             if filename is not None:
                 _disk_remove(filename)
@@ -1912,7 +1919,9 @@ class Cache(object):
             # ``_execute_with_retry`` in ``self._sql_retry``.

             pragma = key[7:]
-            sql('PRAGMA %s = %s' % (pragma, value)).fetchall()
+            curval = sql('PRAGMA %s' % (pragma)).fetchall()[0][0]
+            if curval != value:
+              sql('PRAGMA %s = %s' % (pragma, value)).fetchall()
         elif key.startswith('disk_'):
             attr = key[5:]
             setattr(self._disk, attr, value)
diff --git a/tests/test_core.py b/tests/test_core.py
index d58e013..58d6fc5 100644
--- a/tests/test_core.py
+++ b/tests/test_core.py
@@ -187,6 +187,7 @@ def test_pragma_error(cache):
     fetchall = mock.Mock()

     local.con = con
+    local.pid = os.getpid()
     con.execute = execute
     execute.return_value = cursor
     cursor.fetchall = fetchall
@@ -336,6 +337,7 @@ def test_set_timeout(cache):
     execute = mock.Mock()

     local.con = con
+    local.pid = os.getpid()
     con.execute = execute
     execute.side_effect = sqlite3.OperationalError

@@ -954,6 +956,7 @@ def test_add_timeout(cache):
     execute = mock.Mock()

     local.con = con
+    local.pid = os.getpid()
     con.execute = execute
     execute.side_effect = sqlite3.OperationalError

grantjenks commented 6 years ago

It looks to me that stress_test_deque_mp.py is simply broken. I don't know why it works on my machine. I expected multiprocessing to pickle all the arguments to the process, but it does not. After some print-debugging and careful reading of the docs, multiprocessing does not pickle the arguments for the "fork" start method. I should probably fix that and see where else that assumption is invalid.

Your changes are reasonable but let me ask a few questions:

Ok, good idea.
If I use multiprocessing correctly, can you get rid of the pid check? That may be a worthwhile feature but it's not part of the current design. In the docs, I explicitly state that diskcache objects do not survive forking. But this would solve that issue, I think. Looking up the pid takes 91ns on my machine so I think we can afford it.
Why only set the pragma if it's changed? Just an optimization, or something more?
Why retry on BEGIN IMMEDIATE? I think it should be ok for the API to just Timeout early. Is this just to make the timeout more accurate?
The PROCESS_LOCK only seems to guard setting the pragmas before any tables are created. Pragmas are kind of quirky directives so I can see how this (5) combined with (3) would reduce a lot of pragma setting conflicts. The 50 second delay that you saw, that wasn't simply _sql_retry repeating for 60 seconds during initialization, was it? Also, should this be a multiprocessing.Lock instead? Can there be pragmas conflicts from different processes?

As I look at them, your changes are growing on me. I want diskcache to be bullet proof but I also need the source to be maintainable.

I'm sorry I can't reproduce your issues locally. That would make things much simpler. Out of curiosity, where does your Python executable come from? Is it CPython from python.org? Macports? Homebrew? Custom build?

dougjc commented 6 years ago

Definitely would be good if you could repro these issues. I continue to believe the THREADSAFE setting is the difference. Here are the instructions I used to set up my env:

## Download and install Python 3.6.2 using pyenv
$ brew update && brew upgrade pyenv
$ brew update && brew upgrade   # Optional
## Note: You might need to start a new terminal shell after this step
$ pyenv install 3.6.2
Installing Python-3.6.2...
Installed Python-3.6.2 to /Users/----/.pyenv/versions/3.6.2
## Verify 3.6.2 is installed
$ pyenv versions
 system
 2.6.6
 2.7
 3.6.2

For (2), yes if the tests are fixed and the docs say that cache objects don't survive forks, then dropping this change seems ok. Although it's not an unreasonable protection--either as is or switched to an assert to enforce users don't attempt to use cache objects across a fork.

For (3), since a lot of the errors were on these operations, I was guessing these calls trigger long running operations and trying to reduce the need for the making these calls. Likely this change doesn't matter.

For (4), I didn't look into the context so I don't know the answer your questions as this time but without this change I get errors like:

(env36) doug@dougs-mbp:~/diskcache-test$ for i in `seq 1 30` ; do echo $i; date; nosetests;  done
1
Sun Nov  4 08:14:48 PST 2018
....Exception in thread Thread-12:
Traceback (most recent call last):
  File "/Users/doug/diskcache-test/diskcache/core.py", line 614, in _transact
    sql('BEGIN IMMEDIATE')
sqlite3.OperationalError: database is locked

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/doug/diskcache-test/tests/stress_test_core.py", line 134, in worker
    cache.set(key, value, expire=EXPIRE)
  File "/Users/doug/diskcache-test/diskcache/core.py", line 675, in set
    with self._transact(filename) as (sql, cleanup):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/Users/doug/diskcache-test/diskcache/core.py", line 618, in _transact
    raise Timeout
diskcache.core.Timeout

Exception in thread Thread-9:
Traceback (most recent call last):
  File "/Users/doug/diskcache-test/diskcache/core.py", line 614, in _transact
    sql('BEGIN IMMEDIATE')
sqlite3.OperationalError: database is locked

During handling of the above exception, another exception occurred:

For (5), correct that the 50 second (and longer) delay was not the _sql_retry delay but the time for a single execute operation within the _sql_retry loop before it failed with 'database is locked'. Note that given this timing, the previous pragma retry loop of 60000 cycles is essentially 'retry forever until this passes'. I believe you are also correct that different processes could conflict so a multiprocessing lock would in some sense be safer but these are also more complex and seem suited to single multiprocessing applications -- for example what if multiple completely separate applications are using diskcache with the same cache location? If we really want single access during init, then I think adding a file system lock perhaps like https://pypi.org/project/filelock/ makes more sense.

Finally, got one error on the nosetests which I don't think is related to this issue but want to capture it here:

Sun Nov  4 08:29:10 PST 2018
...........E.......................................................................................................................................................................................................................................................
======================================================================
ERROR: Stress test multiple threads and processes.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/env36/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/Users/doug/diskcache-test/tests/stress_test_fanout.py", line 281, in stress_test_mp
    stress_test(processes=4, threads=4)
  File "/Users/doug/diskcache-test/tests/stress_test_fanout.py", line 249, in stress_test
    with open('output-%s.pkl' % num, 'rb') as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'output-1.pkl'

----------------------------------------------------------------------
Ran 259 tests in 273.576s

FAILED (errors=1)

grantjenks commented 6 years ago

With your brew and pyenv instructions, I was able to build python with the same compilation flags for SQLite and reproduce the bug! I also think I've now fixed all the bugs! I've combined the changes in the issue-85 branch and updated the pull request. Travis and AppVeyor are running now.

I was able to avoid a file-lock or any kind of lock around the pragma-setting code during initialization. The trick was to set the timeout to zero and handle the retries manually while also doing the check for the old value of the pragma. That takes care of (5).

I integrated (1), (2), and (3) from your changes. Thank you for troubleshooting on your end. I particularly like that the code is now fork-safe.

Some of the tests were bogus and have been fixed. That takes care of (4).

Concurrency bugs are hard but I ran the tests repeatedly (thousands of times) overnight without issue. I think we can declare victory. It should also be more robust to CI testing.

Before I merge the issue-85 branch and deploy the changes, would you test that it's all working on your end? If you you can run the test suite repeatedly then I think we can count it fixed :)

dougjc commented 6 years ago

Super that with the brew install you were able to reproduce the issues I was seeing.

I ran 30 nosetests cycles with the new code without issues.

Great to avoid any sort of lock on the initialization PRAGMA commands -- I think checking the setting value before each update attempt is key.

Thanks for working through this!

grantjenks commented 6 years ago

Released version 3.1.0