Add sqlite3 as another possible backing store for the dbm module

rhettinger commented 1 year ago

Right now we support ndbm and gnu.dbm which might or might not be part of a given build. The fallback is the super slow dumbdbm. Not the sqlite3 is part of the standard build, we can do better.

The module docstring says:

Future versions may change the order in which implementations are tested for existence, and add interfaces to other dbm-like implementations.

The future is now. Let's provide a fast, stable, robust, always available alternative.

This can be done will pure python calls to the existing sqlite3 module, or there can be a C extension that calls the Sqlite3 C API directly.

This would automatically be available to the shelve module, giving us a high quality, persistent key-value store.

Linked PRs

gh-114481
gh-115447
gh-115448
gh-115449

erlend-aasland commented 1 year ago

TL;DR

dbm.dumb is not slower than SqliteDict
dbm.dumb is not fail safe, so try to avoid using it until it is fixed
SQLite is a good alternative for persistent dict backend, especially with JSON encoding

In details

The dbm.dumb is not so slow. According to my measurements the existing SqliteDict can be as fast ast dbm.dumb, but only with some tweaks (using WAL mode). The measurements with code are in this pull request: https://github.com/Dobatymo/lmdb-python-dbm/pull/2
If high speed is needed, they can use vedis / unqlite or lmdbm.
Using SQLite backend can be very-very helpful if I would like to view or change the database manually. So if the speed of SqliteDict is enough, I'd prefer use that with JSON encoding (not pickle).
SQLite may be not the best choice storing binary data. Although it could be a good option if we can change shelve module to allow to use json encoding also, and not only have the option to set pickle protocol version.
pysos is also pretty fast and have the ability to review / change the database if the size of the database is small enough to load into an editor. (For example using it as a persistent state backend for small amount of data.)

The biggest problem with dbm.dumb (at least for me) is that it is not fail safe, see the example below. Moreover, it is not mentioned in the documentation. If you kill the Python process, the database will be corrupted every time. (This is because the key file is updated only on adding and deleting keys, and on close.)

import dbm.dumb
import shelve
import itertools

database = dbm.dumb.open("test", "c")
shelf = shelve.Shelf(database)

if "key" in shelf:
    print("The stored value is: " + shelf["key"])
    input("Press ENTER to continue")
    del shelf["key"]

try:
    for index in itertools.count():
        print(f"Persist value {index} to database")
        shelf["key"] = f"value {index} " + ("x" * index)
except:
    print("Exiting gracefully. Try killing the Python process.")

Performance measurements

Sorry. in Windows I do not have dbm.gnu. Copied from https://github.com/Dobatymo/lmdb-python-dbm/pull/2#issuecomment-1345534594 The numbers are seconds required to do the mentioned amount of work.

write

items	lmdbm	vedis	unqlite	rocksdict	dbm.gnu	semidbm	pysos	dbm.dumb	sqlite-wal	sqlite-autocommit	sqlite-batch	dummypickle	dummyjson
10	0.015	0.000	0.000	0.031	-	0	0	0	0.015	0.015	0	0.015	0
100	0.141	0.000	0.000	0.047	-	0	0	0.031	0.031	0.109	0.094	0.156	0.172
1000	1.609	0.000	0.016	0.062	-	0.016	0.031	0.328	0.328	1.015	0.937	1.781	4.250
10000	15.359	0.062	0.062	0.172	-	0.140	0.250	3.219	3.125	- 17.922	19.453	95.188	-
100000	-	0.719	0.672	1.390	-	1.562	2.687	-	31.172	-	-	-	-
1000000	-	9.234	9.140	16.594	-	17.484	27.390	-	-	-	-	-	-

batch

items	lmdbm	vedis	unqlite	rocksdict	dbm.gnu	semidbm	pysos	dbm.dumb	sqlite-wal	sqlite-autocommit	sqlite-batch	dummypickle	dummyjson
10	0.000	0.000	0.000	-	-	-	-	0	0.016	0	0	0	0
100	0.000	0.000	0.000	-	-	-	-	0.032	0.031	0.078	0	0	0
1000	0.000	0.015	0.000	-	-	-	-	0.312	0.109	0.719	0.109	0	0
10000	0.031	0.047	0.046	-	-	-	-	3.140	1.046	- 16.156	1.047	0	0.047
100000	0.422	0.609	0.594	-	-	-	-	-	10.875	-	10.985	0.375	2.859
1000000	4.188	8.500	8.406	-	-	-	-	-	-	-	-	41.760	-

read

items	lmdbm	vedis	unqlite	rocksdict	dbm.gnu	semidbm	pysos	dbm.dumb	sqlite-wal	sqlite-autocommit	sqlite-batch	dummypickle	dummyjson
10	0.000	0.000	0.000	0.031	-	0	0	0	0	0	0	0	0
100	0.000	0.000	0.000	0.031	-	0	0	0	0.031	0.016	0.031	0	0
1000	0.000	0.000	0.000	0.046	-	0.015	0	0.078	0.218	0.297	0.282	0	0
10000	0.047	0.047	0.047	0.093	-	0.109	0.141	0.765	2.079	2.890	3.013	0.031	0.015
100000	0.610	0.641	0.640	0.735	-	1.422	1.797	-	23.797	-	29.375	0.187	0.218
1000000	6.297	8.609	8.765	20.391	-	15.891	20.485	-	-	-	-	2.141	-

combined

items	lmdbm	vedis	unqlite	rocksdict	dbm.gnu	semidbm	pysos	dbm.dumb	sqlite-wal	sqlite-autocommit	sqlite-batch	dummypickle	dummyjson
10	0.156	0.000	0.015	0.047	-	0	0	0.078	0.235	0.390	0.406	0.156	0.156
100	0.156	0.015	0.000	0.047	-	0.016	0	0.078	0.235	0.390	0.390	0.156	0.203
1000	0.156	0.000	0.000	0.047	-	0.015	0.015	0.094	0.235	0.406	0.390	0.188	0.672
10000	0.156	0.015	0.015	0.047	-	0.015	0.031	0.203	0.234	0.469	0.406	0.594	5.141
100000	0.171	0.000	0.000	0.047	-	0.140	0.250	-	0.266	-	0.484	5.547	49.750
1000000	0.172	0.015	0.016	0.047	-	1.297	2.407	-	-	-	-	90.718	-

rhettinger commented 1 year ago

@rhettinger, would you like to continue the discussion here

If you want to run with this, do whatever makes sense for you. The other issue is mostly stale and likely needs a fresh start. A new first draft is likely just an afternoon project. If you don't want to run with it, let me know and I'll work on it.

I don't know that there needs to be much discussion because we have collections.abc.MutableMapping to structure the implementation and because the APIs for shelve and dbm are already established. This is a clearly desirable buildout that addresses the known issues with dumbdbm. With Sqlite3, we get down to a single file in a standard format, we get ACID guarantees, concurrency and compaction, and we get a dbm that will be present in every build.

If you're interested, here's some code that can be used as a starting point:

import sqlite3
from contextlib import suppress
from collections.abc import MutableMapping

BUILD_TABLE = "CREATE TABLE Dict (key TEXT NOT NULL, value BLOB NOT NULL, PRIMARY KEY (key));"
GET_SIZE = "SELECT COUNT (*) FROM Dict;"
LOOKUP_KEY = "SELECT value FROM Dict WHERE key = ?;"
STORE_KV = "REPLACE INTO Dict (key, value) VALUES (?, ?);"
DELETE_KEY = "DELETE FROM Dict WHERE key = ?;"
ITER_KEYS = "SELECT key FROM Dict;"

class SQLiteDict(MutableMapping):

    def __init__(self, dbname=':memory:'):
        self.conn = sqlite3.connect(dbname, isolation_level=None) 
        self.conn.execute('PRAGMA journal_mode = wal') 
        with suppress(sqlite3.OperationalError):
            self.conn.execute(BUILD_TABLE)

    def __len__(self):
        return self.conn.execute(GET_SIZE).fetchone()[0]

    def __getitem__(self, key):
        value_tuple = self.conn.execute(LOOKUP_KEY, (key,)).fetchone()
        if value_tuple is None:
            raise KeyError(key)
        return value_tuple[0]

    def __setitem__(self, key, value):
        self.conn.execute(STORE_KV, (key, value));
        self.conn.commit()

    def __delitem__(self, key):
        if key not in self:
            raise KeyError(key)
        self.conn.execute(DELETE_KEY, (key,));
        self.conn.commit()

    def __iter__(self):
        curs = self.conn.execute(ITER_KEYS)
        while (value_tuple := curs.fetchone()):
            yield value_tuple[0]
        curs.close()

    def close(self):
        self.conn.close()
        self.conn = None

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.close()

Other thoughts:

If we wanted, we could implement dict-like tracking of key insertion order but the API doesn't seem to require it.
The idea is to leave the above class as a simple as possible and to let a separate DB_Sqllite class build around it to provide readonly logic and anything else specific to the dbm API. That would nicely decouple the core dictionary logic from the needs of dbm.
To make this support concurrency, the mutating methods need to be pause and wait if the db is locked.

erlend-aasland commented 1 year ago

If you want to run with this, do whatever makes sense for you. The other issue is mostly stale and likely needs a fresh start.

I prefer closing the old issue; we definitely need a fresh start here. I'll take care of it.

A new first draft is likely just an afternoon project. If you don't want to run with it, let me know and I'll work on it.

I'll give it a shot; it'll be a nice chance for me to get to know the shelve and dbm modules better.

If you're interested, here's some code that can be used as a starting point:

Thanks!

erlend-aasland commented 9 months ago

Finally got back to this; I added @rhettinger's patch, amended it so the generic test_dbm test suite passed, and created PR #114481. Go nuts.

erlend-aasland commented 9 months ago

For flag='n', if there is an existing database file, should we remove the existing database file (current behaviour) and create a new one, or should we try to re-use the existing database file and instead remove only the existing Dict table if there is one ("DROP TABLE Dict" then "CREATE TABLE ..."). The docs says to "always create a new database", but that may be understood as "always create a new [dbm] database", not necessarily "always create a new database [file]". Let's keep this simple and remove the file (current behaviour of #114481); the added complexity is not worth it.

AdamWill commented 4 months ago

This seems to have broken the beaker library: https://github.com/bbangert/beaker/issues/242 . Its test suite, which passes fine on Python 3.12, fails on Python 3.13 in dbm-based tests, with errors about sqlite3 thread violations. Isn't it a bit bad if code using dbm which worked fine with the previous default backend fails with the new one?

AdamWill commented 4 months ago

of course, I guess it's possible beaker is just not handling threads correctly, but the other backends weren't smart enough to notice, and beaker's tests aren't good enough to catch it either. I'm not sure which is the case.

python / cpython