[META] Support Hybrid Disk/Memory Algorithms

jmazanec15 commented 10 months ago

Overview

One common problem around dense vector search is the amount of memory required to implement a solution that delivers high performance and recall. The problem is grounded in the fact that many ANN algorithms require fast, random access to the full floating representations of the vectors as well as auxiliary data structures (in-memory algorithms). HNSW is one example of such an algorithm. While there are techniques, like Product Quantization, that allow the memory footprint of the algorithm to be greatly reduced, it can be difficult to achieve high recall (refer to a blog I wrote awhile ago around IVFPQ vs. HNSW for billion-scale search: https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/). Anecdotally, users typically want a recall above 0.9, where recall is defined as the ratio of neighbors returned that are ground truth nearest neighbors. Similarly, techniques such as scalar quantization and dimensionality reduction can also be employed, but some users may want a more significant reduction in memory.

Another approach that has gained a lot of interest is utilizing fast access disks (i.e. SSDs) in order to extend memory. I will refer to these algorithms as hybrid algorithms. hybrid algorithms will keep some kind of a small representation of the index in memory and the rest on disk. Then, during search, the algorithms will intelligently select a few items to read from disk with the goal of optimizing query latency/working set size/recall tradeoff. A few popular examples are:

DiskANN — an hybrid graph traversal approach — https://suhasjs.github.io/files/diskann_neurips19.pdf
faiss IVF on disk — mapping the IVF algorithm to store vectors on disk — https://github.com/facebookresearch/faiss/blob/main/benchs/distributed_ondisk/README.md
SPANN — a partitioning based approach similar to IVF — https://arxiv.org/pdf/2111.08566.pdf

For OpenSearch, given the potential benefit, we should investigate adding support. There are several approaches that we can take to add support:

Integrate already existing functionality from faiss
Add new engines to support new algorithms
Enhance existing engines with new/modified algorithms
Do nothing

For each approach, there are also a lot of factors to consider - (integration feasibility into OpenSearch, support for aux features such as filtering, etc.). At the moment, we do not have enough data to figure out what route to go. Therefore, we will need to conduct an investigation into existing approaches. This issue is a general META issue that tracks the project as a whole. All feedback/collaboration is welcome. If there are approaches that you have tried that worked well (or not so well), please share your experiences!

Tasks

(Task list is subject to evolve)

[ ] Research approaches and compile comparison report with respect to performance as well as feasibility wrt OpenSearch [ISSUE: TBD]
[ ] (If applicable) Create RFC suggestion integration plan
[ ] (if applicable) Create proof of concept integration with OpenSearch
[ ] (if applicable) Conduct performance evaluation against OpenSearch
[ ] (if applicable) Productionalize code

Related Issues

related issue: #758

navneet1v commented 10 months ago

@jmazanec15 Thanks for creating this issue. I kind of align on the high level task, but would love to see more breakdown of these tasks with time bound investigation and scope.

Is there any specific feedback you are looking here?

vamshin commented 10 months ago

Please +1 if you wish to have this feature prioritized

jmazanec15 commented 10 months ago

Thanks @navneet1v

but would love to see more breakdown of these tasks with time bound investigation and scope.

Sure, will evolve this as we go. Mainly put out the issue for tracking purposes.

Is there any specific feedback you are looking here?

Not specific feedback - would like to see if anyone has used a disk based approach and wants to share their opinion. This is a tracking issue, not an RFC. There will be more issues to come seeking more direct feedback.

luyuncheng commented 10 months ago

+1 LGTM

samuel-oci commented 9 months ago

adding the below resources, looks like a similar issue came out in Lucene: https://foojay.io/today/jvector-1-0/ https://github.com/apache/lucene/issues/12615

credit to @reta for bringing those to my attention

opensearch-project / k-NN