iterative / datachain

AI-data warehouse to enrich, transform and analyze data from cloud storages
https://docs.datachain.ai
Apache License 2.0
940 stars 55 forks source link

Improve management of iterator returned by `collect` (`database table is locked` error) #377

Open shcheklein opened 2 months ago

shcheklein commented 2 months ago

Action Items

Description

Related to https://github.com/iterative/datachain-examples/issues/15

Simple way to reproduce:

from datachain import DataChain

ds = DataChain.from_values(fib=[1, 1, 2, 3, 5, 8])
itr = ds.collect("fib")
print(next(itr))

ds = DataChain.from_values(fib=[1, 1, 2, 3, 5, 8])
ds.show()

Causes something like (when you run it within a test suite at least, since it is causing DB cleanup, but the same can be imitated by let's say making the second part a bit more complicated - with UDF):

>           result = self.db.execute(*self.compile_to_args(query))
E           sqlite3.OperationalError: database table is locked: sqlite_master

/Users/ivan/Projects/dvcx/src/datachain/data_storage/sqlite.py:199: OperationalError

It's caused (I think) by us using:

with super().select(*db_signals).as_iterable() as rows:
     ...
     yield

It is an open connection / cursor staying underneath. It affects probably also other DataChains since we are reusing the same DB.

Workaround

Obvious workaround - use limit (+ offset if needed) to make sure that it's a single (or whatever number of elements is needed) and / or list(...) or iterate through them.

mattseddon commented 2 months ago

We should check why read-only iterator locks the data for other data chains (it fails on creating tables for UDF for example, or removing tables on cleanup - so, on updating system tables on SQLite). Can we make these read-only stuff a non-blocker in the first place

It doesn't seem like any SQLite connections are opened in read-only mode. They all take both the read and write locks.

I had a look at introducing a read method in the SQLiteDatabaseEngine that would look something like this:

    @retry_sqlite_locks
    def read(self, query):
        with sqlite3.connect(
            "file::memory:?mode=ro&nolock=1&cache=shared",
            uri=True,
            detect_types=DETECT_TYPES,
        ) as connection:
            return connection.execute(*self.compile_to_args(query))

Unfortunately, that doesn't seem to work (at least running as a test), perhaps because you can't open the in-memory database in read-only mode.

We need to make it a context manager (so that it close itself and release database underneath)

Doesn't this go against the decision that we made earlier to avoid using a session as a context manager?

shcheklein commented 2 months ago

Doesn't this go against the decision that we made earlier to avoid using a session as a context manager?

it's not about session, it's only about this Iterator returned by collect - it should be closable if we can't make DB work in a mode where there is no lock.