question: how does instructor-large and splade differ in semantic search

vikramsoni2 commented 7 months ago

Great project, I really liked the idea which sums up all good techniques in one project to build rag. Trying to implement the same using llama-index library. However, I find that the documents returned from dense + sparse embeddings are almost always the same.

How does your implementation differ in querying splade embeddings + instructor-large embeddings? What I see in the code is that you are doing the same similarity search and the only difference is in storing the splade embeddings as sparse matrix.

Can you provide some more insights into the hybrid search ? Thanks!

snexus commented 7 months ago

Thanks, looking forward to seeing your implementation when you finish it.

Sparse and dense embeddings produce different types of embeddings, arguably becoming more effective when combined rather than relying on sparse or dense embeddings alone.

SPLADE (sparse embeddings) is, in a sense, an improvement on BM25. It encodes information based on term matching rather than semantic meaning. Dense embeddings, on the other hand, encode semantic meaning. This hybrid search implies the generation of both sparse and dense embeddings.

What I see in the code is that you are performing the same similarity search, and the only difference is in storing the SPLADE embeddings as a sparse matrix.

The similarity search, whether it's cosine or something else, is just a technique to compare two arbitrary vectors; we can use any other technique here. Since SPLADE is high-dimensional (around 32,000 for SPLADE) and sparse (meaning most of the entries are zero), it is efficient to store them in a sparse matrix. Cosine similarity can then be used to query the most "similar" embeddings from that sparse matrix.

In general, here is how my implementation works:

First, we encode the query using the SPLADE algorithm and use a cosine similarity search to retrieve the most relevant documents from the pre-computed SPLADE embeddings.
Then, we encode the query using a dense embeddings algorithm (e.g., instruct-large) and use cosine similarity search to retrieve the relevant documents from dense embeddings.
Finally, we create a union of the documents from steps 1 and 2 and feed them to a reranker to sort them according to relevance.

This article from Pinecone explains in great detail about SPLADE and how to use them:

https://www.pinecone.io/learn/splade/

snexus commented 7 months ago

Please reopen if you want to have a follow-up discussion. Thanks.

snexus / llm-search

question: how does instructor-large and splade differ in semantic search #70