run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.89k stars 5.09k forks source link

[Question]: Showing duplicate entries in my vector db #14611

Open iam-amiya opened 3 months ago

iam-amiya commented 3 months ago

Question Validation

Question

I have created nodes using ingestion pipeline:

import time
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import SummaryExtractor

pipeline = IngestionPipeline(
    transformations=[
        TextCleaner(),
        SummaryExtractor(prompt_template=DEFAULT_SUMMARY_PROMPT_TMPL),
        VoyageEmbeddModel,
    ],
)

CvFirstChunk = pipeline.run(documents = first_node)
time.sleep(90)
CvSecondChunk = pipeline.run(documents = second_node)
time.sleep(90)
CvThirdChunk = pipeline.run(documents = third_node)
time.sleep(90)
CvFourthChunk = pipeline.run(documents = fourth_node)
time.sleep(90)
CvFifthChunk = pipeline.run(documents = fifth_node)
time.sleep(90)
CvSixthChunk = pipeline.run(documents = sixth_node)

and I have total 113 nodes after I concatenate all the chunks. however When I am upserting those nodes in my pinecone vector db it is showing vector count 572

this is the code i am using to upserting the nodes to my vec db.

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    nodes,
    storage_context=storage_context,
    embed_model= EmbeddModel,
)

can anyone please help me out with it.

dosubot[bot] commented 3 months ago

The issue of duplicate entries in your vector database could be due to several factors in your ingestion and upserting process. Here are some steps and best practices to help you identify and resolve the problem:

  1. Ensure Unique Node Identifiers: Each node should have a unique identifier to prevent duplicates. When using the IngestionPipeline, make sure that each node has a unique ref_doc_id or node.id_.

  2. Check for Existing Hashes: Before upserting nodes, check if a node with the same ref_doc_id already exists in the document store and if its hash matches the current node's hash. If the hash is different, update the node; otherwise, skip it.

  3. Handle Upserts and Deletions: Use the docstore_strategy to manage duplicates by comparing document hashes or IDs. The IngestionPipeline supports strategies like UPSERTS, DUPLICATES_ONLY, or UPSERTS_AND_DELETE.

  4. PineconeVectorStore Configurations: Ensure that you are using the appropriate configurations in the PineconeVectorStore to manage duplicates. For example, Pinecone's default behavior is to replace duplicate entries if the same document is inserted twice.

Here is an example of how to handle upserts in the IngestionPipeline:

def _handle_upserts(self, nodes: List[BaseNode], store_doc_text: bool = True) -> List[BaseNode]:
    assert self.docstore is not None

    existing_doc_ids_before = set(self.docstore.get_all_document_hashes().values())
    doc_ids_from_nodes = set()
    deduped_nodes_to_run = {}
    for node in nodes:
        ref_doc_id = node.ref_doc_id if node.ref_doc_id else node.id_
        doc_ids_from_nodes.add(ref_doc_id)
        existing_hash = self.docstore.get_document_hash(ref_doc_id)
        if not existing_hash:
            self.docstore.set_document_hash(ref_doc_id, node.hash)
            deduped_nodes_to_run[ref_doc_id] = node
        elif existing_hash and existing_hash != node.hash:
            self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
            if self.vector_store is not None:
                self.vector_store.delete(ref_doc_id)
            self.docstore.set_document_hash(ref_doc_id, node.hash)
            deduped_nodes_to_run[ref_doc_id] = node
        else:
            continue

    if self.docstore_strategy == DocstoreStrategy.UPSERTS_AND_DELETE:
        doc_ids_to_delete = existing_doc_ids_before - doc_ids_from_nodes
        for ref_doc_id in doc_ids_to_delete:
            self.docstore.delete_document(ref_doc_id)
            if self.vector_store is not None:
                self.vector_store.delete(ref_doc_id)

    nodes_to_run = list(deduped_nodes_to_run.values())
    self.docstore.add_documents(nodes_to_run, store_text=store_doc_text)
    return nodes_to_run

Additionally, ensure that your PineconeVectorStore is configured correctly:

vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    namespace="my_namespace",
    insert_kwargs={"some_key": "some_value"},
    batch_size=50,
    remove_text_from_metadata=True,
    add_sparse_vector=True,
    tokenizer=my_tokenizer,
    default_empty_query_vector=[0.0] * 128,
)

By following these practices, you can help prevent and manage duplicate entries in your vector database [1][2][3].

To continue talking to Dosu, mention @dosu.