tursodatabase / libsql

libSQL is a fork of SQLite that is both Open Source, and Open Contributions.
https://turso.tech/libsql
MIT License
9.54k stars 252 forks source link

Vector diskann impl #1560

Closed sivukhin closed 2 months ago

sivukhin commented 2 months ago

Context

Third branch in the series for DiskANN implementation. This PR introduce DiskANN algorithm itself :tada:

The algorithm core based on the paper LM-DiskANN: Low Memory Footprint in Disk-Native Dynamic Graph-Based ANN Indexing

Current implementation can be suboptimal in some places due to usage of very trivial data structures (lists / arrays). That's how it was intended - we will improve performance in the subsequent PRs but make this one simpler.

Nevertheless, this PR tries to address most heavy operation in the algorithm - reading blobs from disk - and aims to make as little reads as possible (utilizing sqlite3_blob_reopen if possible).

From the performance perspective rough hierarchy of operation cost looks like this:

  1. read/write diskann node block
  2. distance calculation between two vectors (for example, openai vectors are 1536 dimensions long - so this is clearly very slow operation)
  3. operations with non-compressed vector payload
  4. all other operations

Changes