Open iam-amiya opened 3 months ago
The issue of duplicate entries in your vector database could be due to several factors in your ingestion and upserting process. Here are some steps and best practices to help you identify and resolve the problem:
Ensure Unique Node Identifiers: Each node should have a unique identifier to prevent duplicates. When using the IngestionPipeline
, make sure that each node has a unique ref_doc_id
or node.id_
.
Check for Existing Hashes: Before upserting nodes, check if a node with the same ref_doc_id
already exists in the document store and if its hash matches the current node's hash. If the hash is different, update the node; otherwise, skip it.
Handle Upserts and Deletions: Use the docstore_strategy
to manage duplicates by comparing document hashes or IDs. The IngestionPipeline
supports strategies like UPSERTS
, DUPLICATES_ONLY
, or UPSERTS_AND_DELETE
.
PineconeVectorStore Configurations: Ensure that you are using the appropriate configurations in the PineconeVectorStore
to manage duplicates. For example, Pinecone's default behavior is to replace duplicate entries if the same document is inserted twice.
Here is an example of how to handle upserts in the IngestionPipeline
:
def _handle_upserts(self, nodes: List[BaseNode], store_doc_text: bool = True) -> List[BaseNode]:
assert self.docstore is not None
existing_doc_ids_before = set(self.docstore.get_all_document_hashes().values())
doc_ids_from_nodes = set()
deduped_nodes_to_run = {}
for node in nodes:
ref_doc_id = node.ref_doc_id if node.ref_doc_id else node.id_
doc_ids_from_nodes.add(ref_doc_id)
existing_hash = self.docstore.get_document_hash(ref_doc_id)
if not existing_hash:
self.docstore.set_document_hash(ref_doc_id, node.hash)
deduped_nodes_to_run[ref_doc_id] = node
elif existing_hash and existing_hash != node.hash:
self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
if self.vector_store is not None:
self.vector_store.delete(ref_doc_id)
self.docstore.set_document_hash(ref_doc_id, node.hash)
deduped_nodes_to_run[ref_doc_id] = node
else:
continue
if self.docstore_strategy == DocstoreStrategy.UPSERTS_AND_DELETE:
doc_ids_to_delete = existing_doc_ids_before - doc_ids_from_nodes
for ref_doc_id in doc_ids_to_delete:
self.docstore.delete_document(ref_doc_id)
if self.vector_store is not None:
self.vector_store.delete(ref_doc_id)
nodes_to_run = list(deduped_nodes_to_run.values())
self.docstore.add_documents(nodes_to_run, store_text=store_doc_text)
return nodes_to_run
Additionally, ensure that your PineconeVectorStore
is configured correctly:
vector_store = PineconeVectorStore(
pinecone_index=pinecone_index,
namespace="my_namespace",
insert_kwargs={"some_key": "some_value"},
batch_size=50,
remove_text_from_metadata=True,
add_sparse_vector=True,
tokenizer=my_tokenizer,
default_empty_query_vector=[0.0] * 128,
)
By following these practices, you can help prevent and manage duplicate entries in your vector database [1][2][3].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I have created nodes using ingestion pipeline:
and I have total 113 nodes after I concatenate all the chunks. however When I am upserting those nodes in my pinecone vector db it is showing vector count 572
this is the code i am using to upserting the nodes to my vec db.
can anyone please help me out with it.