run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.21k stars 4.94k forks source link

[Question]: Where do I calculate the similarity between nodes in query and vector db when using faiss? #14400

Open whoo9112 opened 2 months ago

whoo9112 commented 2 months ago

Question Validation

Question

I'm testing Q&A using faiss vector db in llama index, and I wonder where the similarity of top_k chunks is calculated. Can you tell me the function or file where the calculation formula is performed?

dosubot[bot] commented 2 months ago

The similarity calculation in the Q&A using the Faiss vector database in Llama Index is performed in the query method of the FaissVectorStore class. Specifically, the calculation is done using the self._faiss_index.search function.

Here is the relevant code snippet:

def query(
    self,
    query: VectorStoreQuery,
    **kwargs: Any,
) -> VectorStoreQueryResult:
    """Query index for top k most similar nodes.

    Args:
        query_embedding (List[float]): query embedding
        similarity_top_k (int): top k most similar nodes

    """
    if query.filters is not None:
        raise ValueError("Metadata filters not implemented for Faiss yet.")

    query_embedding = cast(List[float], query.query_embedding)
    query_embedding_np = np.array(query_embedding, dtype="float32")[np.newaxis, :]
    dists, indices = self._faiss_index.search(
        query_embedding_np, query.similarity_top_k
    )
    dists = list(dists[0])
    # if empty, then return an empty response
    if len(indices) == 0:
        return VectorStoreQueryResult(similarities=[], ids=[])

    # returned dimension is 1 x k
    node_idxs = indices[0]

    filtered_dists = []
    filtered_node_idxs = []
    for dist, idx in zip(dists, node_idxs):
        if idx < 0:
            continue
        filtered_dists.append(dist)
        filtered_node_idxs.append(str(idx))
    return VectorStoreQueryResult(
        similarities=filtered_dists, ids=filtered_node_idxs
    )

Additionally, the get_top_k_embeddings function in llama_index/legacy/indices/query/embedding_utils.py also performs similarity calculations. This function uses a similarity function (default or custom) to compute the similarity between the query embedding and the stored embeddings, and then retrieves the top-k most similar embeddings.

Here is the relevant code snippet:

def get_top_k_embeddings(
    query_embedding: List[float],
    embeddings: List[List[float]],
    similarity_fn: Optional[Callable[..., float]] = None,
    similarity_top_k: Optional[int] = None,
    embedding_ids: Optional[List] = None,
    similarity_cutoff: Optional[float] = None,
) -> Tuple[List[float], List]:
    """Get top nodes by similarity to the query."""
    if embedding_ids is None:
        embedding_ids = list(range(len(embeddings)))

    similarity_fn = similarity_fn or default_similarity_fn

    embeddings_np = np.array(embeddings)
    query_embedding_np = np.array(query_embedding)

    similarity_heap: List[Tuple[float, Any]] = []
    for i, emb in enumerate(embeddings_np):
        similarity = similarity_fn(query_embedding_np, emb)
        if similarity_cutoff is None or similarity > similarity_cutoff:
            heapq.heappush(similarity_heap, (similarity, embedding_ids[i]))
            if similarity_top_k and len(similarity_heap) > similarity_top_k:
                heapq.heappop(similarity_heap)
    result_tups = sorted(similarity_heap, key=lambda x: x[0], reverse=True)

    result_similarities = [s for s, _ in result_tups]
    result_ids = [n for _, n in result_tups]

    return result_similarities, result_ids

These functions are key parts of the similarity calculation process in Llama Index [1][2][3].

To continue talking to Dosu, mention @dosu.