tursodatabase / libsql

libSQL is a fork of SQLite that is both Open Source, and Open Contributions.
https://turso.tech/libsql
MIT License
9.54k stars 252 forks source link

vector search: speed up random row selection query #1651

Closed sivukhin closed 1 month ago

sivukhin commented 1 month ago

Context

As vector node blocks usually are pretty huge (larger thane page size) - it leads to the situation where main shadow table B-Tree started to be a bit degenerate: all leafs will have only one cell.

This is OK for read workloads - but for random row selection queries this lead to huge overhead as all leaf pages will be read just for the sake of single rowid!

-- for 5k rows B-Tree has depth of 3 and a lot of leaf pages 
-- (>1.5k - just sum up all degrees form second level internal pages)
-- each leaf page has 2-3 cells only most of the space occupied with huge blobs of data
$> SELECT path, ncell FROM dbstat WHERE pagetype = 'internal' AND name = 'x_idx_shadow';
/|3
/000/|452
/001/|446
/002/|382
/003/|383

This PR fixes this behaviour by adding lightweight index over rowid key in the shadow table. Leaf B-Tree pages for this index will contain hundreds of rowids and overhead from reading single leaf page will be relatively small.

-- index b-tree has only 2 levels and more "healthier" and efficient structure
$> SELECT path, ncell FROM dbstat WHERE pagetype = 'internal' AND name = 'x_idx_shadow_idx';
/|13

Note, that sqlite is clever enough to use index instead of base table for our query:

$> EXPLAIN QUERY PLAN SELECT rowid FROM x_idx_shadow LIMIT 1 OFFSET ABS(RANDOM()) % MAX((SELECT COUNT(*) FROM x_idx_shadow), 1);
QUERY PLAN
|--SCALAR SUBQUERY 1
|  `--SCAN x_idx_shadow USING COVERING INDEX x_idx_shadow_idx
`--SCAN x_idx_shadow USING COVERING INDEX x_idx_shadow_idx

Changes