tursodatabase / libsql

libSQL is a fork of SQLite that is both Open Source, and Open Contributions.
https://turso.tech/libsql
MIT License
9.54k stars 252 forks source link

Vector search adaptive block #1596

Closed sivukhin closed 2 months ago

sivukhin commented 2 months ago

Context

Now we hard coded 65536 as a default block size which is waste of space especially when we are working with relatively low-dimensional embeddings (like 64d).

This PR introduces max_edges=[int] index parameter and also estimate reasonable default in case if parameter is omitted.

The logic behind current estimate is the following:

  1. We don't want to have uncontrollable write amplification - so we should have hard upper bound for disk amplification due to neighbours redundancy in every node block. Current hard limit is 50x redundancy compared to the node storage overhead (note, that with compression of neighbours this hard limit will automatically adjust and allow us to store more neighbours per node block)
  2. The amount of neighbours should grows with the dimension of the embeddings space. I don't have references about why something like $D^{\frac{1}{C}}$ (where $D$ - space dimensionality) must be used, but I have hand-wavy arguments for myself like this: with $D^\frac{1}{C}$ neighbours for every vertex we will have $O(D)$ neighbours in $C$-radius of our graph and with $O(D)$ neighbours (if they are random enough) we have an ability to go into direction which will strictly improve our distance (generally, in $D$-dimensional metric space you need at least $D+1$ neighbours (simplex) in order to have an improvement in any direction). In my experiments $2 \sqrt{D} \ldots 3 \sqrt{D}$ shows pretty good 10-recall@10 values (90%+) - so I choose $3 \sqrt{D}$ as a second part of the max_edges estimation.

Overall, formula looks like this:

$max\_edges = \min( 3 \sqrt{D}, 50 \frac{node\_overhead(type, dims)}{edge\_overhead(type, dims)} )$

Changes