Narek/external index storage

Ashot jan Hi!

tldr;

This is an attempt to add external storage to USearch to help our upgrades at Lantern ! It allows swapping the storage format (usearch-v2, usearch-v3, lantern-postgres) without touching the core index structures.

As far as I can tell it does not have a runtime performance impact.

Would you be open to merging this kind of interface into upstream usearch or should we maintain it outside?

We have been using a fork of USearch with this kind of external storage for about half a year at Lantern. This is an attempt to upstream it. We have chatted about this before so some of the stuff below may be repetition, but putting it here for completeness.

Motivation

Currently, the core high-performance implementation of vector search is weaved through storage, serialization, file IO interfaces. This makes it harder to:

Change the underlying storage.
Change the serialization format (e.g. usearch's planned v2->v3 transition)
Add storage-level features such as neighbor list compression (motivation: when m in hnsw is large, neighbor lists become a significant portion of index memory footprint)

One might argue that (1) can be achieved by passing a custom allocator to index_gt or index_dense_gt. This has limitations and did not work for us for two reasons:

(most important) allocators tie the lifetime of the index to the lifetime of index_gt. In Lantern, we are dealing with a persistent index - all changes are saved to postgres data files and replicated if needed. So, the index memory needs to outlive any usearch data structures.
Existing allocator interfaces allows defining allocation logic per memory-type granularity (memory for vectors, memory for nodes, etc.). We needed to do allocations with a different kind of partitioning (memory for all components of node i, node i+1, etc)

The storage interface proposed here helps us achieve the goals above.

Design

This PR adds a storage_at template parameter to usearch index types which implements:

node and vector allocation and reset
Access management for concurrent node and vector access
Index save/load from a stream
Viewing a memory mapped index
Compile-time exhaustive API type-checking for storage providers

The exact storage layout is opaque to the rest of usearch - all serialization/deserialization logic is in storage_at so new storage formats can be implemented without touching the rest of the code. As an example, I implemented a new storage provider in std_storage.hpp that uses cpp standard library containers and stores nodes and vectors adjacent to each other when serializing to a file (similar to usearch v1 format, but this one adds padding between node tape and vector tape in serialization to make sure view() does not result in unaligned memory accesses).

The Storage API

I designed the storage API around how the current usearch v2 storage worked. I tried to minimize amount of changes in index.hpp and index_dense.hpp to hopefully make reviewing easier. I think the storage interface can be simplified and improved in many ways, especially after a usearch v3 format transition. I am open to changing the full API, so long as there is some kind of storage API.

NOTE: There is no new logic in this PR. most of it is just factoring out storage-related interfaces and functions to the separate header.

The storage API, as defined in the beginning of storage.hpp and implemented by several storage backends. index_gt and index_dense_gt were modified to use this storage API. I added a helper type-enforcer macro that runs compile-time checks to make sure the provided interface meets the necessary interface- requirements to be a usearch storage provider.

Next?

This has some rough edges, most of which should be listed below. I will come back and update this if more things come up. Before putting time into those, however, I just wanted to see whether you would be open to merging this into mainline usearch. This would help us at Lantern a lot and would be a big step towards upstream-usearch compatibility for us.

We will likely start using a simplified version of this API from Lantern soon, so can report back on how well it works for our case.

TODOs

[ ] Fix comments around view+view_internal+reset
[ ] Figure out whether (or how?) storage layer should maintain info about number of vectors it is storing
- Needed for save/restore/reset
- Hard with set_at interface which does not tell whether the old spot was updated or new spot is created
[ ] implement swap move for index_dense.hpp
[ ] Add tests around swap/move/copy indexes
[ ] Implement swap for index.hpp
[ ] Add node_copy to storage api
[ ] (Maybe?) Move config_, nodescount,... etc and other serialization to storage as well
[ ] Implement compact in index.hpp
[ ] Add memory_useage() interface to storage
[ ] Check that My choices of taking a reference vs r-value reference are correct
[ ] Save serialization_config on the stored file
[ ] Store punned info in index binary to prevent accidental wrong code loading the index
[ ] Split precomputed_constants to keep storage-related constants in storage layer
[ ] (maybe?) Move slot_lookup into storage
[ ] (maybe?) Move or copy nodescount to storage so clear() and reset() can have more intuitive implementations
[ ] Make all serialization interfaces take and use progress& (I copied the API from current usearch and some APIs there are not taking progress&)
[ ] Get rid of matrixrows and matrixcols in storage_v2

unum-cloud / usearch