Closed SchweitzerGAO closed 5 months ago
The issue you're experiencing with Faiss not retrieving the most relevant documents could be due to an incomplete or incorrect mapping between document IDs and their corresponding texts. Ensure that your id_to_text_map
dictionary contains all the document IDs that the Faiss index might return. Here is a relevant code snippet from the FaissReader
class that raises an error if a document ID is not found in the id_to_text_map
:
def load_data(
self,
query: np.ndarray,
id_to_text_map: Dict[str, str],
k: int = 4,
separate_documents: bool = True,
) -> List[Document]:
dists, indices = self._index.search(query, k)
documents = []
for qidx in range(indices.shape[0]):
for didx in range(indices.shape[1]):
doc_id = indices[qidx, didx]
if doc_id not in id_to_text_map:
raise ValueError(
f"Document ID {doc_id} not found in id_to_text_map."
)
text = id_to_text_map[doc_id]
documents.append(Document(text=text))
if not separate_documents:
text_list = [doc.get_content() for doc in documents]
text = "\n\n".join(text_list)
documents = [Document(text=text)]
return documents
Additionally, note that in the FaissVectorStore
class, text is not stored within the Faiss index itself. Instead, only the embeddings are stored, and the corresponding text needs to be managed separately. Ensure that your id_to_text_map
is correctly populated with all possible document IDs that your Faiss index might return. Here is a relevant snippet from the FaissVectorStore
class:
class FaissVectorStore(BasePydanticVectorStore):
stores_text: bool = False
_faiss_index = PrivateAttr()
def __init__(
self,
faiss_index: Any,
) -> None:
import faiss
self._faiss_index = cast(faiss.Index, faiss_index)
super().__init__()
def add(
self,
nodes: List[BaseNode],
**add_kwargs: Any,
) -> List[str]:
new_ids = []
for node in nodes:
text_embedding = node.get_embedding()
text_embedding_np = np.array(text_embedding, dtype="float32")[np.newaxis, :]
new_id = str(self._faiss_index.ntotal)
self._faiss_index.add(text_embedding_np)
new_ids.append(new_id)
return new_ids
@property
def client(self) -> Any:
return self._faiss_index
Make sure that your id_to_text_map
is correctly populated with all possible document IDs that your Faiss index might return.
Additionally, consider the following best practices:
L2
, inner product
, cosine
) and algorithm (Brute force
, HNSW
) based on your requirements.ef_construction
, ef
, and M
for HNSW to balance accuracy, speed, and memory usage.By following these steps and ensuring your id_to_text_map
is correctly populated, you should be able to retrieve relevant documents more accurately using Faiss.
To continue the conversation, mention @dosu.
@dosu
We finally figured out that the metadata
will be embedded by default, which will influence the retrieval. A possible solution is setting the excluded_embed_metadata_keys
of each TextNode
to a list containing the fields in metadata
that you do not want to embed.
Question Validation
Question
The main problem is that whatever metric is used(We tried L2 and inner product) , whatever algorithm is leveraged(We tried Brute force and HNSW)and whatever embedding model is utilized(We tried Jina and BCE-embedding) when indexing and retrieving, the most relevant document cannot be correctly retrieved or retrieved with a low score(or far distance). I am just wondering why