python / cpython

The Python programming language
https://www.python.org
Other
63.21k stars 30.27k forks source link

Add sqlite3 as another possible backing store for the dbm module #100414

Closed rhettinger closed 8 months ago

rhettinger commented 1 year ago

Right now we support ndbm and gnu.dbm which might or might not be part of a given build. The fallback is the super slow dumbdbm. Not the sqlite3 is part of the standard build, we can do better.

The module docstring says:

Future versions may change the order in which implementations are tested for existence, and add interfaces to other dbm-like implementations.

The future is now. Let's provide a fast, stable, robust, always available alternative.

This can be done will pure python calls to the existing sqlite3 module, or there can be a C extension that calls the Sqlite3 C API directly.

This would automatically be available to the shelve module, giving us a high quality, persistent key-value store.

Linked PRs

erlend-aasland commented 1 year ago

See also #48033

erlend-aasland commented 1 year ago

IIRC, I've got a fairly up-to-date branch for this, based on your previous efforts, Raymond.

erlend-aasland commented 1 year ago

@rhettinger, would you like to continue the discussion here or on #48033? There is no use in keeping two identical issues open; let's close one of them.

presidento commented 1 year ago

Here are my 50 cents about this.

TL;DR

In details

The biggest problem with dbm.dumb (at least for me) is that it is not fail safe, see the example below. Moreover, it is not mentioned in the documentation. If you kill the Python process, the database will be corrupted every time. (This is because the key file is updated only on adding and deleting keys, and on close.)

import dbm.dumb
import shelve
import itertools

database = dbm.dumb.open("test", "c")
shelf = shelve.Shelf(database)

if "key" in shelf:
    print("The stored value is: " + shelf["key"])
    input("Press ENTER to continue")
    del shelf["key"]

try:
    for index in itertools.count():
        print(f"Persist value {index} to database")
        shelf["key"] = f"value {index} " + ("x" * index)
except:
    print("Exiting gracefully. Try killing the Python process.")

Performance measurements

Sorry. in Windows I do not have dbm.gnu. Copied from https://github.com/Dobatymo/lmdb-python-dbm/pull/2#issuecomment-1345534594 The numbers are seconds required to do the mentioned amount of work.

write

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.015 0.000 0.000 0.031 - 0 0 0 0.015 0.015 0 0.015 0
100 0.141 0.000 0.000 0.047 - 0 0 0.031 0.031 0.109 0.094 0.156 0.172
1000 1.609 0.000 0.016 0.062 - 0.016 0.031 0.328 0.328 1.015 0.937 1.781 4.250
10000 15.359 0.062 0.062 0.172 - 0.140 0.250 3.219 3.125 - 17.922 19.453 95.188 -
100000 - 0.719 0.672 1.390 - 1.562 2.687 - 31.172 - - - -
1000000 - 9.234 9.140 16.594 - 17.484 27.390 - - - - - -

batch

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.000 0.000 0.000 - - - - 0 0.016 0 0 0 0
100 0.000 0.000 0.000 - - - - 0.032 0.031 0.078 0 0 0
1000 0.000 0.015 0.000 - - - - 0.312 0.109 0.719 0.109 0 0
10000 0.031 0.047 0.046 - - - - 3.140 1.046 - 16.156 1.047 0 0.047
100000 0.422 0.609 0.594 - - - - - 10.875 - 10.985 0.375 2.859
1000000 4.188 8.500 8.406 - - - - - - - - 41.760 -

read

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.000 0.000 0.000 0.031 - 0 0 0 0 0 0 0 0
100 0.000 0.000 0.000 0.031 - 0 0 0 0.031 0.016 0.031 0 0
1000 0.000 0.000 0.000 0.046 - 0.015 0 0.078 0.218 0.297 0.282 0 0
10000 0.047 0.047 0.047 0.093 - 0.109 0.141 0.765 2.079 2.890 3.013 0.031 0.015
100000 0.610 0.641 0.640 0.735 - 1.422 1.797 - 23.797 - 29.375 0.187 0.218
1000000 6.297 8.609 8.765 20.391 - 15.891 20.485 - - - - 2.141 -

combined

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.156 0.000 0.015 0.047 - 0 0 0.078 0.235 0.390 0.406 0.156 0.156
100 0.156 0.015 0.000 0.047 - 0.016 0 0.078 0.235 0.390 0.390 0.156 0.203
1000 0.156 0.000 0.000 0.047 - 0.015 0.015 0.094 0.235 0.406 0.390 0.188 0.672
10000 0.156 0.015 0.015 0.047 - 0.015 0.031 0.203 0.234 0.469 0.406 0.594 5.141
100000 0.171 0.000 0.000 0.047 - 0.140 0.250 - 0.266 - 0.484 5.547 49.750
1000000 0.172 0.015 0.016 0.047 - 1.297 2.407 - - - - 90.718 -
rhettinger commented 1 year ago

@rhettinger, would you like to continue the discussion here

If you want to run with this, do whatever makes sense for you. The other issue is mostly stale and likely needs a fresh start. A new first draft is likely just an afternoon project. If you don't want to run with it, let me know and I'll work on it.

I don't know that there needs to be much discussion because we have collections.abc.MutableMapping to structure the implementation and because the APIs for shelve and dbm are already established. This is a clearly desirable buildout that addresses the known issues with dumbdbm. With Sqlite3, we get down to a single file in a standard format, we get ACID guarantees, concurrency and compaction, and we get a dbm that will be present in every build.

If you're interested, here's some code that can be used as a starting point:

import sqlite3
from contextlib import suppress
from collections.abc import MutableMapping

BUILD_TABLE = "CREATE TABLE Dict (key TEXT NOT NULL, value BLOB NOT NULL, PRIMARY KEY (key));"
GET_SIZE = "SELECT COUNT (*) FROM Dict;"
LOOKUP_KEY = "SELECT value FROM Dict WHERE key = ?;"
STORE_KV = "REPLACE INTO Dict (key, value) VALUES (?, ?);"
DELETE_KEY = "DELETE FROM Dict WHERE key = ?;"
ITER_KEYS = "SELECT key FROM Dict;"

class SQLiteDict(MutableMapping):

    def __init__(self, dbname=':memory:'):
        self.conn = sqlite3.connect(dbname, isolation_level=None) 
        self.conn.execute('PRAGMA journal_mode = wal') 
        with suppress(sqlite3.OperationalError):
            self.conn.execute(BUILD_TABLE)

    def __len__(self):
        return self.conn.execute(GET_SIZE).fetchone()[0]

    def __getitem__(self, key):
        value_tuple = self.conn.execute(LOOKUP_KEY, (key,)).fetchone()
        if value_tuple is None:
            raise KeyError(key)
        return value_tuple[0]

    def __setitem__(self, key, value):
        self.conn.execute(STORE_KV, (key, value));
        self.conn.commit()

    def __delitem__(self, key):
        if key not in self:
            raise KeyError(key)
        self.conn.execute(DELETE_KEY, (key,));
        self.conn.commit()

    def __iter__(self):
        curs = self.conn.execute(ITER_KEYS)
        while (value_tuple := curs.fetchone()):
            yield value_tuple[0]
        curs.close()

    def close(self):
        self.conn.close()
        self.conn = None

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.close()

Other thoughts:

erlend-aasland commented 1 year ago

If you want to run with this, do whatever makes sense for you. The other issue is mostly stale and likely needs a fresh start.

I prefer closing the old issue; we definitely need a fresh start here. I'll take care of it.

A new first draft is likely just an afternoon project. If you don't want to run with it, let me know and I'll work on it.

I'll give it a shot; it'll be a nice chance for me to get to know the shelve and dbm modules better.

If you're interested, here's some code that can be used as a starting point:

Thanks!

erlend-aasland commented 9 months ago

Finally got back to this; I added @rhettinger's patch, amended it so the generic test_dbm test suite passed, and created PR #114481. Go nuts.

erlend-aasland commented 9 months ago

For flag='n', if there is an existing database file, should we remove the existing database file (current behaviour) and create a new one, or should we try to re-use the existing database file and instead remove only the existing Dict table if there is one ("DROP TABLE Dict" then "CREATE TABLE ..."). The docs says to "always create a new database", but that may be understood as "always create a new [dbm] database", not necessarily "always create a new database [file]". Let's keep this simple and remove the file (current behaviour of #114481); the added complexity is not worth it.

AdamWill commented 4 months ago

This seems to have broken the beaker library: https://github.com/bbangert/beaker/issues/242 . Its test suite, which passes fine on Python 3.12, fails on Python 3.13 in dbm-based tests, with errors about sqlite3 thread violations. Isn't it a bit bad if code using dbm which worked fine with the previous default backend fails with the new one?

AdamWill commented 4 months ago

of course, I guess it's possible beaker is just not handling threads correctly, but the other backends weren't smart enough to notice, and beaker's tests aren't good enough to catch it either. I'm not sure which is the case.