simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.62k stars 109 forks source link

Filter table by a large bunch of ids #555

Open redraw opened 1 year ago

redraw commented 1 year ago

Hi! this might be a question related to both SQLite & sqlite-utils, and you might be more experienced with them.

I have a large bunch of ids, and I'm wondering which is the best way to query them in terms of performance, and simplicity if possible.

The naive approach would be something like select * from table where rowid in (?, ?, ?...) but that wouldn't scale if ids are >1k.

Another approach might be creating a temp table, or in-memory db table, insert all ids in that table and then join with the target one.

I failed to attach an in-memory db both using sqlite-utils, and plain sql's execute(), so my closest approach is something like,

def filter_existing_video_ids(video_ids):
    db = get_db()  # contains a "videos" table
    db.execute("CREATE TEMPORARY TABLE IF NOT EXISTS tmp (video_id TEXT NOT NULL PRIMARY KEY)")
    db["tmp"].insert_all([{"video_id": video_id} for video_id in video_ids])
    for row in db["tmp"].rows_where("video_id not in (select video_id from videos)"):
        yield row["video_id"]
    db["tmp"].drop()

That kinda worked, I couldn't find an option in sqlite-utils's create_table() to tell it's a temporary table. Also, tmp table is not dropped finally, neither using .drop() despite being created with the keyword TEMPORARY. I believe it should be automatically dropped after connection/session ends though I read.

chapmanjacobd commented 1 year ago

You may want to try doing a performance comparison between this and just selecting all the ids with few constraints and then doing the filtering within python.

That might seem like a lazy-programmer, inefficient way but queries with large resultsets are a different profile than what databases like SQLITE are designed for. That is not to say that SQLITE is slow or that python is always faster but when you start reading >20% of an index there is an equilibrium that is reached. Especially when adding in writing extra temp tables and stuff to memory/disk. And especially given the NOT IN style of query...

You may also try chunking like this:

def chunks(lst, n) -> Generator:
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

SQLITE_PARAM_LIMIT = 32765

data = []
chunked = chunks(video_ids, consts.SQLITE_PARAM_LIMIT)
for ids in chunked:
    data.expand(
        list(
            db.query(
                f"""SELECT * from videos
                WHERE id in ("""
                + ",".join(["?"] * len(ids))
                + ")",
                (*ids,),
            )
        )
    )

but that actually won't work with your NOT IN requirements. You need to query the full resultset to check any row.

Since you are doing stuff with files/videos in SQLITE you might be interested in my side project: https://github.com/chapmanjacobd/library