Vector search adaptive block

Context

Now we hard coded 65536 as a default block size which is waste of space especially when we are working with relatively low-dimensional embeddings (like 64d).

This PR introduces max_edges=[int] index parameter and also estimate reasonable default in case if parameter is omitted.

The logic behind current estimate is the following:

We don't want to have uncontrollable write amplification - so we should have hard upper bound for disk amplification due to neighbours redundancy in every node block. Current hard limit is 50x redundancy compared to the node storage overhead (note, that with compression of neighbours this hard limit will automatically adjust and allow us to store more neighbours per node block)
The amount of neighbours should grows with the dimension of the embeddings space. I don't have references about why something like $D^{\frac{1}{C}}$ (where $D$ - space dimensionality) must be used, but I have hand-wavy arguments for myself like this: with $D^\frac{1}{C}$ neighbours for every vertex we will have $O(D)$ neighbours in $C$-radius of our graph and with $O(D)$ neighbours (if they are random enough) we have an ability to go into direction which will strictly improve our distance (generally, in $D$-dimensional metric space you need at least $D+1$ neighbours (simplex) in order to have an improvement in any direction). In my experiments $2 \sqrt{D} \ldots 3 \sqrt{D}$ shows pretty good 10-recall@10 values (90%+) - so I choose $3 \sqrt{D}$ as a second part of the max_edges estimation.

Overall, formula looks like this:

$max\_edges = \min( 3 \sqrt{D}, 50 \frac{node\_overhead(type, dims)}{edge\_overhead(type, dims)} )$

Changes

Added max_edges parameter in the index creating syntax:

CREATE INDEX t_idx ON t ( libsql_vector_idx(e, 'max_edges=32' ) )

Estimate default value for max_edges with the formula from above
Treat block_size binary parameter literally if it greater than 128, else - shift the value by DISKANN_BLOCK_SIZE_SHIFT to get real block size in bytes

tursodatabase / libsql

Vector search adaptive block #1596

Context

Changes