juji-io / datalevin

A simple, fast and versatile Datalog database
https://github.com/juji-io/datalevin
Eclipse Public License 1.0
1.13k stars 63 forks source link

Indexing dense numeric vector #145

Open huahaiy opened 2 years ago

huahaiy commented 2 years ago

Add a data type :vec, for indexing dense numeric vectors and search based on similarity.

huahaiy commented 1 year ago

Current state of art is ScaNN, see paper, extended paper, which is built for tensorflow serving and depends on tensorflow.

Prior literature: https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf http://proceedings.mlr.press/v51/guo16a.pdf https://www.amazon.com/Quantization-Compression-Springer-International-Engineering/dp/0792391810

huahaiy commented 1 year ago

ScaNN is fast for it does lookup table in SIMD register: https://medium.com/@kumon/similarity-search-scann-and-4-bit-pq-ab98766b32bd (paper https://arxiv.org/pdf/1704.07355.pdf)

Faiss has better packaging, also, it has PQ4 implementation as well. https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors#4-bit-pq-comparison-with-scann

huahaiy commented 1 year ago

For our initial version, https://github.com/nmslib/hnswlib seems to be ideal, for the following reason:

  1. It performs well across the board. https://github.com/erikbern/ann-benchmarks/
  2. It out performs everything else for huge vectors https://ann-benchmarks.com/kosarak-jaccard_10_jaccard.html. The current batch of state of the art libraries optimize for vectors less than 1000 dimensions. But my hunch is that huge vectors (greater than 10k dimensions) are needed to really take advantage of the vector views of semantics (i..e. the so called vector symbolic approach).
  3. It is small and seems to be much easier to integrate than others.
huahaiy commented 1 year ago

Another option, is https://github.com/unum-cloud/usearch

huahaiy commented 3 months ago

Some recent entries linked here: https://github.com/microsoft/DiskANN?tab=readme-ov-file