unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
2.28k stars 143 forks source link

Narek/external index storage #335

Open Ngalstyan4 opened 10 months ago

Ngalstyan4 commented 10 months ago

Ashot jan Hi!

tldr;

This is an attempt to add external storage to USearch to help our upgrades at Lantern ! It allows swapping the storage format (usearch-v2, usearch-v3, lantern-postgres) without touching the core index structures.

As far as I can tell it does not have a runtime performance impact.

Would you be open to merging this kind of interface into upstream usearch or should we maintain it outside?


We have been using a fork of USearch with this kind of external storage for about half a year at Lantern. This is an attempt to upstream it. We have chatted about this before so some of the stuff below may be repetition, but putting it here for completeness.

Motivation

Currently, the core high-performance implementation of vector search is weaved through storage, serialization, file IO interfaces. This makes it harder to:

  1. Change the underlying storage.
  2. Change the serialization format (e.g. usearch's planned v2->v3 transition)
  3. Add storage-level features such as neighbor list compression (motivation: when m in hnsw is large, neighbor lists become a significant portion of index memory footprint)

One might argue that (1) can be achieved by passing a custom allocator to index_gt or index_dense_gt. This has limitations and did not work for us for two reasons:

  1. (most important) allocators tie the lifetime of the index to the lifetime of index_gt. In Lantern, we are dealing with a persistent index - all changes are saved to postgres data files and replicated if needed. So, the index memory needs to outlive any usearch data structures.
  2. Existing allocator interfaces allows defining allocation logic per memory-type granularity (memory for vectors, memory for nodes, etc.). We needed to do allocations with a different kind of partitioning (memory for all components of node i, node i+1, etc)

The storage interface proposed here helps us achieve the goals above.

Design

This PR adds a storage_at template parameter to usearch index types which implements:

  1. node and vector allocation and reset
  2. Access management for concurrent node and vector access
  3. Index save/load from a stream
  4. Viewing a memory mapped index
  5. Compile-time exhaustive API type-checking for storage providers

The exact storage layout is opaque to the rest of usearch - all serialization/deserialization logic is in storage_at so new storage formats can be implemented without touching the rest of the code. As an example, I implemented a new storage provider in std_storage.hpp that uses cpp standard library containers and stores nodes and vectors adjacent to each other when serializing to a file (similar to usearch v1 format, but this one adds padding between node tape and vector tape in serialization to make sure view() does not result in unaligned memory accesses).

The Storage API

I designed the storage API around how the current usearch v2 storage worked. I tried to minimize amount of changes in index.hpp and index_dense.hpp to hopefully make reviewing easier. I think the storage interface can be simplified and improved in many ways, especially after a usearch v3 format transition. I am open to changing the full API, so long as there is some kind of storage API.

NOTE: There is no new logic in this PR. most of it is just factoring out storage-related interfaces and functions to the separate header.

The storage API, as defined in the beginning of storage.hpp and implemented by several storage backends. index_gt and index_dense_gt were modified to use this storage API. I added a helper type-enforcer macro that runs compile-time checks to make sure the provided interface meets the necessary interface- requirements to be a usearch storage provider.

Next?

This has some rough edges, most of which should be listed below. I will come back and update this if more things come up. Before putting time into those, however, I just wanted to see whether you would be open to merging this into mainline usearch. This would help us at Lantern a lot and would be a big step towards upstream-usearch compatibility for us.

We will likely start using a simplified version of this API from Lantern soon, so can report back on how well it works for our case.

TODOs