run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.24k stars 4.95k forks source link

[Question]: ValueError: doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 not found. #12603

Closed GildeshAbhay closed 1 week ago

GildeshAbhay commented 5 months ago

Question Validation

Question

So i am running a basic RAG application and storing embeddings in mongodb, but when I run the query engine, I get the error doc_id not found.

Here is my basic code ` content = document["content"] doc = Document(text=content) node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size) nodes = node_parser.get_nodes_from_documents([doc])

    for node in nodes:
        node_embedding = embed_model.get_text_embedding(
            node.get_content(metadata_mode="all")
        )
        node.embedding = node_embedding
    len(nodes)
    db = 'staging'
    collection = 'abhay_test'

    vector_store = MongoDBAtlasVectorSearch(client, db_name=db, collection_name=collection, index_name="nanopore_index", embeddings = OpenAIEmbeddings())
    vector_store.add(nodes)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(nodes=nodes, storage_context=storage_context,)        
    index1 = VectorStoreIndex.from_vector_store(vector_store)

    postproc = None
    reranker = SentenceTransformerRerank(top_n = rr_metadata)
    retriever = index.as_retriever(similarity_top_k=retrieval_metadata_similarity)
    retriever = AutoMergingRetriever(retriever,index.storage_context,verbose=True)
    response_synthesizer = get_response_synthesizer(response_mode=response_mode)
    node_postprocessors = [postproc, reranker]
    node_postprocessors = [processor for processor in node_postprocessors if processor is not None]
    query_engine = RetrieverQueryEngine(retriever, node_postprocessors=node_postprocessors)
    query_engine = RetrieverQueryEngine(retriever)
    summary_whole = query_engine.query(rag_prompt_original)
    summary_text = str(summary_whole)  # Convert the summary object to a string

` I guess the problem is that one of the nodes is not found ? Would changing the embedding dimensions help?

dosubot[bot] commented 5 months ago

Hey there @GildeshAbhay! Good to see you diving into more adventurous territories with RAG and MongoDB. Let's see if we can tackle this error together.

Based on the error message you provided, it seems like the issue is related to the retrieval of a document with a specific doc_id from your MongoDB database. The error message ValueError: doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 not found. indicates that the document with the doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 could not be found in the database when the query engine tried to retrieve it.

Changing the embedding dimensions would not resolve this issue. The dimensions of the embeddings are related to how the data is represented in the vector space, but the doc_id not found error suggests that there is a mismatch or an issue in how documents are indexed or queried in the database.

Here are a few things you could check to resolve this issue:

  1. Verify that the document exists in the database: Check your MongoDB database to ensure that the document with the doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 actually exists. You can do this by running a query directly on your MongoDB database.

  2. Check the doc_id used in the query: Ensure that the doc_id used in your query matches the actual doc_id stored in the database. If there is a mismatch, this could be the cause of the error.

  3. Check the database and collection names: Make sure that the database name (db) and collection name (collection) you're using in your code match the actual database and collection names in MongoDB.

  4. Ensure the vector search index is correctly created: The VectorStoreIndex class in LlamaIndex handles the indexing of nodes and their embeddings. Make sure that the vector search index is correctly created and matches the configuration expected by your application.

For more information on how LlamaIndex handles document indexing and retrieval in MongoDB, you can refer to the MongoDocumentStore class and the VectorStoreIndex class in the LlamaIndex repository.

I hope this helps! If you have any more questions or if the issue persists, please provide more details about your setup and the steps you've taken so far.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
GildeshAbhay commented 5 months ago

I checked, its not a document id at all. It's most likely a node id

GildeshAbhay commented 5 months ago

@logan-markewich can you help please

logan-markewich commented 5 months ago

@GildeshAbhay when using a vector db integration, only the vector store is used. However, you need access to more than just the nodes in the vector store for this to work.

You should either manually populate (and persist) the docstore on the storage context, or set store_nodes_override=True when creating your index.

Furthermore, for auto-merging retriever to even work, I think you are missing a step? Normalizing you'd add all nodes to your docstore, and only index the leaf nodes. https://docs.llamaindex.ai/en/latest/examples/retrievers/auto_merging_retriever/?h=auto+mer

GildeshAbhay commented 5 months ago

thanks a lot for responding!!

Here, I edited the code a bit.

` doc = Document(text=content) node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size) nodes = node_parser.get_nodes_from_documents([doc])

    leaf_nodes = get_leaf_nodes(nodes)
    root_nodes = get_root_nodes(nodes)

    for node in nodes:
        node_embedding = embed_model.get_text_embedding(
            node.get_content(metadata_mode="all")
        )
        node.embedding = node_embedding
    len(nodes)
    db = 'staging'
    collection = 'abhay_test'

    vector_store = MongoDBAtlasVectorSearch(client, db_name=db, collection_name=collection, index_name="nanopore_index_1", embeddings = OpenAIEmbeddings())
    vector_store.add(nodes)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(nodes=nodes, storage_context=storage_context,store_nodes_override=True)        
    index = VectorStoreIndex(nodes = leaf_nodes, storage_context = storage_context, nodes_override= True)
    #index = VectorStoreIndex.from_vector_store(vector_store)

    postproc = None
    reranker = SentenceTransformerRerank(top_n = rr_metadata)
    retriever = index.as_retriever(similarity_top_k=retrieval_metadata_similarity)
    retriever = AutoMergingRetriever(retriever,index.storage_context,verbose=True)
    response_synthesizer = get_response_synthesizer(response_mode=response_mode)
    node_postprocessors = [postproc, reranker]
    node_postprocessors = [processor for processor in node_postprocessors if processor is not None]
    query_engine = RetrieverQueryEngine(retriever, node_postprocessors=node_postprocessors)
    query_engine = RetrieverQueryEngine(retriever)
    summary_whole = query_engine.query(rag_prompt_original)`

It's working now. Tell me however, a few things.

  1. If I put leaf nodes instead of nodes in index, the fina, output is only marginally different (95% same)
  2. The docstore part that you mentioned is already taken care by the below code vector_store.add(nodes) Can you confirm?
  3. For actually storing the embeddings in mongodb, is this step necessary? for node in nodes: node_embedding = embed_model.get_text_embedding( node.get_content(metadata_mode="all") ) node.embedding = node_embedding Without this, step, its not possible to store embeddings in mongo db ? And then read it from mongodb?
  4. LAstly, currently each node is stored separately like this image but i want all the nodes embeddings to be stored in the same row of data which harbors the information about the rest of the node. image under a key value pair. How to achieve this?
GildeshAbhay commented 5 months ago

one more thing, if I want to use those embeddings (mentioned above), what should i run?

logan-markewich commented 5 months ago

Yea, it might not make a huge difference, especially if the top k is low

The vector store and docstore are different. If enough nodes are retrieved from the vector store that have the same parent, the are replaced with their parent node (which is only in the docstore)

You can pre-calculate and attach the nodes like you are, but if you didn't, the same would be done under the hood if they were missing.

You can't change how it's stored in mongodb.

I don't know what you mean by using those embeddings?

GildeshAbhay commented 5 months ago

THanks again for taking the time to reply ! Appreciate it!

Yea, it might not make a huge difference, especially if the top k is low

Cool. So that's sorted.

The vector store and docstore are different. If enough nodes are retrieved from the vector store that have the same parent, the are replaced with their parent node (which is only in the docstore)

Can you please give me the codes for both and explicitly show the difference? Basically, I want to know how to add node/doc information to both and then retrieve both so that I don't waste keys on embeddings.

You can pre-calculate and attach the nodes like you are, but if you didn't, the same would be done under the hood if they were missing.

If I didn't write the node.embeddings lines, the index code gives an error. Says, "embed" not defined.

You can't change how it's stored in mongodb.

Okay. So If my overall goal is to store the embeddings in mongodb just so that I can retrieve it from there (whenever I want to use it, so that re-indexing doesn't have to happen), is it better to then store the embeddings locally as json files and then "upsert" it to mongodb?

I don't know what you mean by using those embeddings?

Using means, using them so that I don't have re-calculate the embeddings and waste tokens.

GildeshAbhay commented 5 months ago

@logan-markewich can you please help here

logan-markewich commented 4 months ago

@GildeshAbhay pretty lost tbh

The vector store stores all your embeddings. There's nothing wasted here.

The docstore stores nodes (I.e. in this case, the parent nodes, which are never embedded)

Let me walk through the auto merging algorithm step by step, seems there is some confusion

YanSte commented 3 months ago

Did you solve your issue i have the same ..

YanSte commented 3 months ago

@GildeshAbhay

I have resolved the issue as follows:

Using MongoDB:

Note: When using MongoDB with the Search Index Atlas, you need to manually create the index vector with doc vector. With a Cluster M10, you can simply code this process.

After that:

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

base_retriever = index.as_retriever(similarity_top_k=...)

storage_context = StorageContext.from_defaults(index_store=index, docstore=docstore)

retriever = AutoMergingRetriever(
    vector_retriever=base_retriever,
    storage_context=storage_context
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[rerank_model],
    llm=...
)