msiemens / tinydb

TinyDB is a lightweight document oriented database optimized for your happiness :)
https://tinydb.readthedocs.org
MIT License
6.84k stars 550 forks source link

Info on Concurrency #487

Closed fny closed 2 years ago

fny commented 2 years ago

It would be great to have info on whether tiny supports concurrent reads and writes. This is not clear from the README.

p-baum commented 2 years ago

I would like to know as well please.

VermiIIi0n commented 2 years ago

It would be great to have info on whether tiny supports concurrent reads and writes. This is not clear from the README.

Unfortunately, I believe it's not possible after viewing the source code.

Concurrent writing/reading almost certainly leads to data corruption.

And I think this piece of info is already presented in the docs.

fny commented 2 years ago

Yeah, its a shame.

On Tue, Oct 11, 2022 at 10:21 AM Mashir0 @.***> wrote:

It would be great to have info on whether tiny supports concurrent reads and writes. This is not clear from the README.

Unfortunately, I believe it's not possible after viewing the source code.

Concurrent writing/reading almost certainly leads to data corruption.

— Reply to this email directly, view it on GitHub https://github.com/msiemens/tinydb/issues/487#issuecomment-1274772743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUPVFWY4VPSRTONW7RTNDWCVZWJANCNFSM6AAAAAAQN5J77Q . You are receiving this because you authored the thread.Message ID: @.***>

msiemens commented 2 years ago

You always can add your own locks (e.g. using Python's locks) to ensure that concurrent writing/reading within the same Python process works correctly. If you have multiple programs, you can add some form of file locking (see e.g. https://stackoverflow.com/questions/489861/locking-a-file-in-python) to tell a process that the file is currently in use.

In general, TinyDB doesn't make any assumptions about the data storage mechanism (due to the ability to drop in your own data storage class), so there is no generic locking in TinyDB (because file-based locking might e.g. not work on network file systems or something like S3). Leaving this to the user is the most flexible solution to me even though it requires some work from users who just use the default JSONStorage.

FeralRobot commented 1 year ago

documentation update needed: class tinydb.middlewares.ConcurrencyMiddleware(storage_cls) Makes TinyDB working with multithreading. Uses a lock so write/read operations are virtually atomic.

msiemens commented 1 year ago

class tinydb.middlewares.ConcurrencyMiddleware(storage_cls) Makes TinyDB working with multithreading. Uses a lock so write/read operations are virtually atomic.

Actually, the ConcurrencyMiddleware has been removed in TinyDB 2.0.0 due to an incorrect implementation (#18).

andryyy commented 10 months ago

I am using TinyDB in a project right now and came across this problem.

Since I'm already using Redis for synchronizing some application states in my cluster, I found it easiest to reuse it for distributed locking.

My application uses a Storage class that's pretty much a copy of JSONStorage with some personal tweaks like automatic backups and custom caching.

In the custom storage's __init__() definition I added this (slightly modified):

...
        self._access_id = kwargs.pop("access_id")
        try:
            if r.get("DATABASE_LOCK") != self._access_id:
                while not r.set(
                    "DATABASE_LOCK",
                    self._access_id,
                    px=4000, # Max lock time in ms
                    nx=True,
                ):
                    continue
        except Exception as e:
            r.delete("DATABASE_LOCK")
            raise
...

The close method looks like this:

...
        self._handle.close()
        r.delete("DATABASE_LOCK")
...

TinyDB is used like this:

TINYDB = {
    "storage": JSONStorageLocked,
    "path": "database/data.json",
    "access_id": str(uuid4()),
}
with TinyDB(**defaults.TINYDB) as db:
    ...

Since the storage instance is reinitialized during operations, I was not able to use id(self) but had to define a fixed ID to use inside the whole context.

At first I tried to only hold a lock inside the read and write definitions, but in stress tests it of course failed sometimes. The application would lock for a data read, unlock, lock for a write and append the data. But between read and write there is a tiny window for another worker to lock and write data, so I dropped that idea.

Of course I could switch back to that logic and check for the last writer's ID, but I'm fine with it now. :) I also use some caching in Redis, so it does not really impact the performance anyway for the few writes I'm doing.