Filter table by a large bunch of ids

simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases

Apache License 2.0

1.62k stars 109 forks source link

Hi! this might be a question related to both SQLite & sqlite-utils, and you might be more experienced with them.

I have a large bunch of ids, and I'm wondering which is the best way to query them in terms of performance, and simplicity if possible.

The naive approach would be something like select * from table where rowid in (?, ?, ?...) but that wouldn't scale if ids are >1k.

Another approach might be creating a temp table, or in-memory db table, insert all ids in that table and then join with the target one.

I failed to attach an in-memory db both using sqlite-utils, and plain sql's execute(), so my closest approach is something like,

def filter_existing_video_ids(video_ids):
    db = get_db()  # contains a "videos" table
    db.execute("CREATE TEMPORARY TABLE IF NOT EXISTS tmp (video_id TEXT NOT NULL PRIMARY KEY)")
    db["tmp"].insert_all([{"video_id": video_id} for video_id in video_ids])
    for row in db["tmp"].rows_where("video_id not in (select video_id from videos)"):
        yield row["video_id"]
    db["tmp"].drop()

That kinda worked, I couldn't find an option in sqlite-utils's create_table() to tell it's a temporary table. Also, tmp table is not dropped finally, neither using .drop() despite being created with the keyword TEMPORARY. I believe it should be automatically dropped after connection/session ends though I read.

def chunks(lst, n) -> Generator: for i in range(0, len(lst), n): yield lst[i : i + n] SQLITE_PARAM_LIMIT = 32765 data = [] chunked = chunks(video_ids, consts.SQLITE_PARAM_LIMIT) for ids in chunked: data.expand( list( db.query( f"""SELECT * from videos WHERE id in (""" + ",".join(["?"] * len(ids)) + ")", (*ids,), ) ) )

simonw / sqlite-utils

Filter table by a large bunch of ids #555