run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.78k stars 5.27k forks source link

[Question]: Cannot retrieve relevant documents with faiss #13686

Closed SchweitzerGAO closed 5 months ago

SchweitzerGAO commented 5 months ago

Question Validation

Question

The main problem is that whatever metric is used(We tried L2 and inner product) , whatever algorithm is leveraged(We tried Brute force and HNSW)and whatever embedding model is utilized(We tried Jina and BCE-embedding) when indexing and retrieving, the most relevant document cannot be correctly retrieved or retrieved with a low score(or far distance). I am just wondering why

dosubot[bot] commented 5 months ago

The issue you're experiencing with Faiss not retrieving the most relevant documents could be due to an incomplete or incorrect mapping between document IDs and their corresponding texts. Ensure that your id_to_text_map dictionary contains all the document IDs that the Faiss index might return. Here is a relevant code snippet from the FaissReader class that raises an error if a document ID is not found in the id_to_text_map:

def load_data(
    self,
    query: np.ndarray,
    id_to_text_map: Dict[str, str],
    k: int = 4,
    separate_documents: bool = True,
) -> List[Document]:
    dists, indices = self._index.search(query, k)
    documents = []
    for qidx in range(indices.shape[0]):
        for didx in range(indices.shape[1]):
            doc_id = indices[qidx, didx]
            if doc_id not in id_to_text_map:
                raise ValueError(
                    f"Document ID {doc_id} not found in id_to_text_map."
                )
            text = id_to_text_map[doc_id]
            documents.append(Document(text=text))

    if not separate_documents:
        text_list = [doc.get_content() for doc in documents]
        text = "\n\n".join(text_list)
        documents = [Document(text=text)]

    return documents

Additionally, note that in the FaissVectorStore class, text is not stored within the Faiss index itself. Instead, only the embeddings are stored, and the corresponding text needs to be managed separately. Ensure that your id_to_text_map is correctly populated with all possible document IDs that your Faiss index might return. Here is a relevant snippet from the FaissVectorStore class:

class FaissVectorStore(BasePydanticVectorStore):
    stores_text: bool = False

    _faiss_index = PrivateAttr()

    def __init__(
        self,
        faiss_index: Any,
    ) -> None:
        import faiss
        self._faiss_index = cast(faiss.Index, faiss_index)
        super().__init__()

    def add(
        self,
        nodes: List[BaseNode],
        **add_kwargs: Any,
    ) -> List[str]:
        new_ids = []
        for node in nodes:
            text_embedding = node.get_embedding()
            text_embedding_np = np.array(text_embedding, dtype="float32")[np.newaxis, :]
            new_id = str(self._faiss_index.ntotal)
            self._faiss_index.add(text_embedding_np)
            new_ids.append(new_id)
        return new_ids

    @property
    def client(self) -> Any:
        return self._faiss_index

Make sure that your id_to_text_map is correctly populated with all possible document IDs that your Faiss index might return.

Additionally, consider the following best practices:

  1. Embedding Model: Ensure the embedding model is well-suited for your use case.
  2. Distance Metric and Algorithm: Choose the appropriate distance metric (L2, inner product, cosine) and algorithm (Brute force, HNSW) based on your requirements.
  3. Index Configuration: Configure Faiss index parameters such as ef_construction, ef, and M for HNSW to balance accuracy, speed, and memory usage.
  4. Preprocessing: Ensure documents are properly preprocessed before adding them to the Faiss index.

By following these steps and ensuring your id_to_text_map is correctly populated, you should be able to retrieve relevant documents more accurately using Faiss.

#### Details **Sources** - [llama-index-legacy/llama_index/legacy/readers/faiss.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/faiss.py) - [llama-index-integrations/readers/llama-index-readers-faiss/llama_index/readers/faiss/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-faiss/llama_index/readers/faiss/base.py) - [llama-index-legacy/llama_index/legacy/vector_stores/faiss.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/vector_stores/faiss.py) - [llama-index-integrations/vector_stores/llama-index-vector-stores-faiss/llama_index/vector_stores/faiss/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-faiss/llama_index/vector_stores/faiss/base.py) - [llama-index-core/llama_index/core/indices/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/base.py) - [llama-index-core/llama_index/core/embeddings/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/embeddings/utils.py) - [llama-index-legacy/llama_index/legacy/embeddings/bedrock.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/embeddings/bedrock.py) - [docs/docs/module_guides/models/embeddings.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/module_guides/models/embeddings.md) - [llama-index-integrations/readers/llama-index-readers-preprocess/llama_index/readers/preprocess/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-preprocess/llama_index/readers/preprocess/base.py) - [llama-index-integrations/vector_stores/llama-index-vector-stores-docarray/llama_index/vector_stores/docarray/hnsw.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-docarray/llama_index/vector_stores/docarray/hnsw.py) - [llama-index-integrations/embeddings/llama-index-embeddings-jinaai/llama_index/embeddings/jinaai/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/embeddings/llama-index-embeddings-jinaai/llama_index/embeddings/jinaai/base.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

SchweitzerGAO commented 5 months ago

@dosu We finally figured out that the metadata will be embedded by default, which will influence the retrieval. A possible solution is setting the excluded_embed_metadata_keys of each TextNode to a list containing the fields in metadata that you do not want to embed.