.dump and VACUUM command now works incorrectly because SQLite "confused" about vector index properties: usually indices are "storage-full" and has the data by themself, but vector indices are "storage-less" and offload all storage routines to another tables.
This leads to class of errors when mentioned commands trying to execute same inserts twice: first when shadow table is populated and then when index is created.
This PR aims to solve this issue for .dump and VACUUM command by following tricks (or you may call them "hacks"):
We we completely ignore writes/deletes during VACUUM (which can be detected by mDbFlags & DBFLAG_VACUUM != 0 condition)
In case of conflict in the metadata table during index creation we will deliberately skip refill operation for this vector index. This situation matches the dump specific when tables and their data got imported first, and only then indices got created
Options
Although current solution looks a bit (or a lot?) "hacky" it has several advantages:
All hacks are external from main SQLite code. This means that it will be easier for us to maintain our solution + it has predictable impact on other scenarios not related to vector search (and actually impact to other scenarios is zero!).
.dump will have full information about current DiskANN state - which will be useful for debugging
.dump and VACUUM commands will be faster because DiskANN graph will be already there
There are alternative approach when we can try to make all vector search shadow tables "invisible" for SQLite in .dump and VACUUM processes (like with sqlite_master table). But, implementation of this approach has following downsides:
It will require changes in sqlite source code or some non trivial changes in how we integrate with sqlite (because you can't create table sqlite_libsql_meta_shadow with regular sqlite3_exec without patching sqlite code or switching from sqlite3_exec to something else)
It's hard to predict implication of such decision for future evolution of vector search. SQLite can rely and check invariants related with sqlite_/shadow tables and we can find ourselves fighting with this rules. So, we better to not make this decision without thinking it through thoroughly.
If we will hide shadow tables from .dump and ignore them in VACUUM - then these commands will be slower. Although, this exclusion has nice property that we will rely on the binary formats less and maybe imports will be more "resilient" and safe (especially when export/import will target different libSQL versions).
Changes
Ignore writes if IsVacuum(db) in vectorIndex.c
Skip refill if CONFLICT encountered during index creation
Example of dump result:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE t(e FLOAT32(2));
INSERT INTO t VALUES(X'0000803f00000040');
INSERT INTO t VALUES(X'0000803f00000040');
INSERT INTO t VALUES(X'0000803f00000040');
INSERT INTO t VALUES(X'0000803f00000040');
INSERT INTO t VALUES(X'0000803f00000040');
CREATE TABLE libsql_vector_meta_shadow ( name TEXT PRIMARY KEY, metadata BLOB ) WITHOUT ROWID;
INSERT INTO libsql_vector_meta_shadow VALUES('t_idx',X'01010000000000000003010000000000000004020000000000000002010000000000000005010000000000000006800000000000000007333333333333f33f08460000000000000009c800000000000000');
CREATE TABLE t_idx_shadow (index_key INTEGER , data BLOB, PRIMARY KEY (index_key));
INSERT INTO t_idx_shadow VALUES(1,X'...');
INSERT INTO t_idx_shadow VALUES(3,X'...');
INSERT INTO t_idx_shadow VALUES(5,X'...');
INSERT INTO t_idx_shadow VALUES(7,X'...');
INSERT INTO t_idx_shadow VALUES(9,X'...');
CREATE INDEX t_idx ON t(libsql_vector_idx(e));
COMMIT;
Context
.dump
andVACUUM
command now works incorrectly becauseSQLite
"confused" about vector index properties: usually indices are "storage-full" and has the data by themself, but vector indices are "storage-less" and offload all storage routines to another tables.This leads to class of errors when mentioned commands trying to execute same inserts twice: first when shadow table is populated and then when index is created.
This PR aims to solve this issue for
.dump
andVACUUM
command by following tricks (or you may call them "hacks"):VACUUM
(which can be detected bymDbFlags & DBFLAG_VACUUM != 0
condition)Options
Although current solution looks a bit (or a lot?) "hacky" it has several advantages:
.dump
will have full information about current DiskANN state - which will be useful for debugging.dump
andVACUUM
commands will be faster because DiskANN graph will be already thereThere are alternative approach when we can try to make all vector search shadow tables "invisible" for SQLite in
.dump
andVACUUM
processes (like withsqlite_master
table). But, implementation of this approach has following downsides:sqlite_libsql_meta_shadow
with regularsqlite3_exec
without patching sqlite code or switching fromsqlite3_exec
to something else)sqlite_
/shadow tables and we can find ourselves fighting with this rules. So, we better to not make this decision without thinking it through thoroughly..dump
and ignore them inVACUUM
- then these commands will be slower. Although, this exclusion has nice property that we will rely on the binary formats less and maybe imports will be more "resilient" and safe (especially when export/import will target different libSQL versions).Changes
IsVacuum(db)
invectorIndex.c
Example of dump result: