piskvorky / sqlitedict

Persistent dict, backed by sqlite3 and pickle, multithread-safe.
Apache License 2.0
1.17k stars 131 forks source link

Version 2.0.0 significantly slower #152

Closed psinger closed 1 year ago

psinger commented 2 years ago

Took me some time to find the culprit, but after upgrading to sqlitedict 2.0.0 the writing is significantly slower.

I am writing with:

with SqliteDict("tmp.db") as tmp:
    tmp["tmp"] = 1

Is this expected due to some new functionality?

piskvorky commented 2 years ago

Definitely not expected. Thanks for reporting.

IIRC we merged some contributions around nested exception handling, which could be the culprit. I can't think of anything else.

@psinger are you able to pinpoint the issue please? Or at least provide a minimal reproducible example.

psinger commented 2 years ago

@piskvorky Yes, sure, so you can run for example:

from sqlitedict import SqliteDict
def insert():
    for j in range(100):
        with SqliteDict("tmp.db") as d:
            d["tmp"] = j
            d.commit()

%time insert()

Runtime for version 1.7.0:

CPU times: user 113 ms, sys: 32.3 ms, total: 145 ms
Wall time: 115 ms

Runtime for version 2.0.0:

CPU times: user 147 ms, sys: 54.4 ms, total: 202 ms
Wall time: 10.2 s
piskvorky commented 2 years ago

Oh wow, that's a huge regression. The CPU times look similar but the wall time exploded. Thanks again for reporting.

Are you able to look into it? The surface of changes between 1.7 and 2.0 is pretty small, so hopefully that's not too hard. We might have to revert some commits and make a bugfix release.

psinger commented 2 years ago

Sorry, I will not have time to look into it, I migrated back to 1.7.0 for the time being.

mpenkov commented 2 years ago

OK, reproduced on MacOS Python 3.9:

sergeyich:sqlitedict misha$ cat repro.py
import tempfile
from sqlitedict import SqliteDict
def insert():
    with tempfile.NamedTemporaryFile() as tmp:
        for j in range(100):
            with SqliteDict(tmp.name) as d:
                d["tmp"] = j
                d.commit()
insert()
sergeyich:sqlitedict misha$ time python3.9 repro.py

real    0m10.033s
user    0m0.237s
sys     0m0.246s
$ git co 1.7.0
Note: switching to '1.7.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at ca6758f update author_email to work with PyPI
sergeyich:sqlitedict misha$ time python3.9 repro.py

real    0m0.377s
user    0m0.189s
sys     0m0.179s

I'll run a bisection and see what commit caused this.

mpenkov commented 2 years ago

I think I've found the commit that caused the problem, now trying to work out why something so trivial is causing such a performance regression:

$ git bisect log
git bisect start
# bad: [fdaf06167eebbd22ecd6932ac54487d4f3db1713] update changelog
git bisect bad fdaf06167eebbd22ecd6932ac54487d4f3db1713
# good: [ca6758fadd55e4ee7f36b26e3a39558954e91bc6] update author_email to work with PyPI
git bisect good ca6758fadd55e4ee7f36b26e3a39558954e91bc6
# good: [d5db9e67ecfd9908c06efa9d7a4a9e3e9145550c] More cleanup for flake8
git bisect good d5db9e67ecfd9908c06efa9d7a4a9e3e9145550c
# good: [e6a81f0558b6aa0e4bf4984466c6373d3aee2b52] README.rst: update performance notes.
git bisect good e6a81f0558b6aa0e4bf4984466c6373d3aee2b52
# bad: [12f023652674cd9e061c806fd1d50e3c1958f68d] minimum supported Py version is 3.7
git bisect bad 12f023652674cd9e061c806fd1d50e3c1958f68d
# bad: [3f2df9bb557e3351bc16843612e9090d7f5ba784] update python-package.yml
git bisect bad 3f2df9bb557e3351bc16843612e9090d7f5ba784
# bad: [bb8087ceb4408c0e10e3591912a7596b9448358d] Update sqlitedict.py
git bisect bad bb8087ceb4408c0e10e3591912a7596b9448358d
# bad: [90c16aeaaa69c831c7f71370d560e880c5d630ae] Rename initialized variable to not collide with superclass'es internal one
git bisect bad 90c16aeaaa69c831c7f71370d560e880c5d630ae
# first bad commit: [90c16aeaaa69c831c7f71370d560e880c5d630ae] Rename initialized variable to not collide with superclass'es internal one

https://github.com/RaRe-Technologies/sqlitedict/commit/90c16aeaaa69c831c7f71370d560e880c5d630ae

psinger commented 2 years ago

That's just variable renaming....sounds unlikely it could cause any issues.

mpenkov commented 2 years ago

Nah, look at how that variable is being used. There's a sleep loop that depends on it.

psinger commented 2 years ago

Right, but before it was called self._initialized with the same functionality.

mpenkov commented 2 years ago

It's not quite the same functionality, because the superclass defines that variable and changes its value as it sees fit.

$ grep -B 10 _initialized /usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py --max-count 1

class Thread:
    """A class that represents a thread of control.

    This class can be safely subclassed in a limited fashion. There are two ways
    to specify the activity: by passing a callable object to the constructor, or
    by overriding the run() method in a subclass.

    """

    _initialized = False
mpenkov commented 2 years ago

Basically, the problem itself isn't the commit in question. The problem is the way that sqlitedict is attempting to handle a race condition.

The commit merely unmasks the problem. I suspect that because that race condition handling has been effectively disabled prior to the commit, the race condition may not be such a big deal after all.

I'll have to think about it some more, but I think I see what's going on here.

psinger commented 2 years ago

Is the sleep necessary? Just removing it could fix the issue. But right, I guess the sleep tries to avoid the race condition.

mpenkov commented 1 year ago

Interestingly, cannot reproduce on Py3.9 Linux:


(testenv) misha@cabron:~/git/sqlitedict$ python --version
Python 3.9.14
(testenv) misha@cabron:~/git/sqlitedict$ time python repro.py 

real    0m0.107s
user    0m0.090s
sys     0m0.016s