(Approximate) Nearest Neighbour / Vector Similarity Search

bloodbare commented 4 years ago

Is your feature request related to a problem? Please describe. Nowaday search can be based by BM25 and semantic search. Using cosine similarity with a provided vector at index time we could ponderate BM25 and vector distance to score. I would say that vector is provided at index time and at search time. One document to many indexs by sentence and one query with a vector.

Describe the solution you'd like Integrate Faiss (on any other vector search library) onto tantivy so the merge is done at tantivy at search time.

[Optional] describe alternatives you've considered Right now we have an external vector and tantivy and we merge at middleware but may be an interesting feature to discuss about.

I'm just evaluating if it makes sense.

fulmicoton commented 4 years ago

Using embedding for semantic search is a hot topic.

I might be a bit premature for tantivy right now (I do not know many successful implementation in the industry), but this is something we might want to revisit later.

If you do not have a very specific use case, let's keep this ticket open for the moment and revisit it later.

bloodbare commented 4 years ago

My initial idea was (from the API point of view) to provide a new field type that can index on a vector index (like https://docs.rs/crate/hnsw/0.2.0/source/README.md) on the same "transaction" and on search beeing able to search by vector similarity. How the vector is computed from the text should be outside of the indexer (imho)

acertain commented 4 years ago

Another approach (for text) is https://github.com/AdeDZY/DeepCT, which uses ML to weigh terms in documents & queries, and then uses bm25 with those weights.

https://microsoft.github.io/msmarco/ is good to keep an eye on, the current full ranking entries seem to be inverted index with some ml for query expansion or indexing + language model re-ranking.

fulmicoton commented 4 years ago

@acertain this sounds super interesting !

ansjsun commented 4 years ago

I am trying to do these. Let them work together, not one contains the other， But I found that more transformation may be on faiss.

snakeztc commented 3 years ago

I am very interested in this too and would love to work together. I have research experience in both DeepCT and vector-based search.

fulmicoton commented 3 years ago

@snakeztc Could you lead the development of such a feature? Also DeepCT and dense vector nearest neighbor are very different problem. Which one do you need?

This is a very valuable feature, but it adds up a lot of complexity so the condition for it to be merged are:

a) it should be behind a feature flag.
b) it should not be half baked.
c) it should be well tested
d) it should have an actual user.

I can help with the design/code review provided we are shooting for eventually shipping this.

snakeztc commented 3 years ago

Thata's on my dev list now. I am going to shoot for Deep CT sparse vector instead of dense vector, since that can be done by reusing the current inverted index. Any help on design orr code review would be greatly appreciated. @fulmicoton

AlexMikhalev commented 3 years ago

I am interested in testing/using it. Any update?

davit-b commented 1 year ago

Any updates on this? With Pinecone being under deathly strain, I've begun looking into this lib and other semantic index libs for a DIY solution

bloodbare commented 1 year ago

We finally implemented an index with a similar approach to tantivy with a variant of HNSW at NucliaDB Vectors

shikhar commented 1 year ago

We are using ANN on Tantivy in prod with an implementation of what Faiss calls IVFFlat. It is not open-source ready at this point. Sharing some details for the curious!

ANN index built offline (seaparate from Tantivy), K-Means clustering using linfa-clustering. The ANN index is bincode-serialized and contains the trained centroids and entityID to clusterID (centroid) assignments. Vector data is separately serialized, each cluster's vectors are stored contiguously.
Serving pods load the index into memory, vector data is memory-mapped.
Tantivy Warmer is used to maintain the ANN index as Tantivy segment-level state -- ClusterId -> [DocId], DocId -> VectorIdx.
Custom AnnQuery (tantivy Query implementation) leverages the warmed state for matching and scoring. It computes the nearest clusters for the query vector, and uses desired probing % parameter to pick the topK clusters to match for its DocSet. The Scorer does a dot product of the query vector with that document's vector. We are able to easily combine the AnnQuery with other clauses for filtering.

bladehliu commented 6 months ago

any updates? is this feature a WIP or in plan?

fulmicoton commented 6 months ago

Not planned at all.

triandco commented 2 months ago

@shikhar, this sounds super interesting, are you still planning to open source this at some point? Would love to have a look if possible.

shikhar commented 2 months ago

@triandco I am not at the same co anymore, so cannot speak to any such plans, but believe it to be unlikely as there were simplifying assumptions made to tailor it to our requirements.

bloodbare commented 2 months ago

At nuclia we developed an engine using tantivy and our own vector index (nucliadb_node) that also supports knowledge graph indexing. Its robust and we are using in prod since more than one year with amazing scalable results.

https://github.com/nuclia/nucliadb

quickwit-oss / tantivy

(Approximate) Nearest Neighbour / Vector Similarity Search #815