[Question]: ValueError: doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 not found.

GildeshAbhay commented 5 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

So i am running a basic RAG application and storing embeddings in mongodb, but when I run the query engine, I get the error doc_id not found.

Here is my basic code ` content = document["content"] doc = Document(text=content) node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size) nodes = node_parser.get_nodes_from_documents([doc])

    for node in nodes:
        node_embedding = embed_model.get_text_embedding(
            node.get_content(metadata_mode="all")
        )
        node.embedding = node_embedding
    len(nodes)
    db = 'staging'
    collection = 'abhay_test'

    vector_store = MongoDBAtlasVectorSearch(client, db_name=db, collection_name=collection, index_name="nanopore_index", embeddings = OpenAIEmbeddings())
    vector_store.add(nodes)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(nodes=nodes, storage_context=storage_context,)        
    index1 = VectorStoreIndex.from_vector_store(vector_store)

    postproc = None
    reranker = SentenceTransformerRerank(top_n = rr_metadata)
    retriever = index.as_retriever(similarity_top_k=retrieval_metadata_similarity)
    retriever = AutoMergingRetriever(retriever,index.storage_context,verbose=True)
    response_synthesizer = get_response_synthesizer(response_mode=response_mode)
    node_postprocessors = [postproc, reranker]
    node_postprocessors = [processor for processor in node_postprocessors if processor is not None]
    query_engine = RetrieverQueryEngine(retriever, node_postprocessors=node_postprocessors)
    query_engine = RetrieverQueryEngine(retriever)
    summary_whole = query_engine.query(rag_prompt_original)
    summary_text = str(summary_whole)  # Convert the summary object to a string

` I guess the problem is that one of the nodes is not found ? Would changing the embedding dimensions help?

dosubot[bot] commented 5 months ago

Hey there @GildeshAbhay! Good to see you diving into more adventurous territories with RAG and MongoDB. Let's see if we can tackle this error together.

Based on the error message you provided, it seems like the issue is related to the retrieval of a document with a specific doc_id from your MongoDB database. The error message ValueError: doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 not found. indicates that the document with the doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 could not be found in the database when the query engine tried to retrieve it.

Changing the embedding dimensions would not resolve this issue. The dimensions of the embeddings are related to how the data is represented in the vector space, but the doc_id not found error suggests that there is a mismatch or an issue in how documents are indexed or queried in the database.

Here are a few things you could check to resolve this issue:

Verify that the document exists in the database: Check your MongoDB database to ensure that the document with the doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 actually exists. You can do this by running a query directly on your MongoDB database.
Check the doc_id used in the query: Ensure that the doc_id used in your query matches the actual doc_id stored in the database. If there is a mismatch, this could be the cause of the error.
Check the database and collection names: Make sure that the database name (db) and collection name (collection) you're using in your code match the actual database and collection names in MongoDB.
Ensure the vector search index is correctly created: The VectorStoreIndex class in LlamaIndex handles the indexing of nodes and their embeddings. Make sure that the vector search index is correctly created and matches the configuration expected by your application.

For more information on how LlamaIndex handles document indexing and retrieval in MongoDB, you can refer to the MongoDocumentStore class and the VectorStoreIndex class in the LlamaIndex repository.

I hope this helps! If you have any more questions or if the issue persists, please provide more details about your setup and the steps you've taken so far.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

GildeshAbhay commented 5 months ago

I checked, its not a document id at all. It's most likely a node id

GildeshAbhay commented 5 months ago

@logan-markewich can you help please

logan-markewich commented 5 months ago

@GildeshAbhay when using a vector db integration, only the vector store is used. However, you need access to more than just the nodes in the vector store for this to work.

You should either manually populate (and persist) the docstore on the storage context, or set store_nodes_override=True when creating your index.

Furthermore, for auto-merging retriever to even work, I think you are missing a step? Normalizing you'd add all nodes to your docstore, and only index the leaf nodes. https://docs.llamaindex.ai/en/latest/examples/retrievers/auto_merging_retriever/?h=auto+mer

GildeshAbhay commented 5 months ago

thanks a lot for responding!!

Here, I edited the code a bit.

` doc = Document(text=content) node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size) nodes = node_parser.get_nodes_from_documents([doc])

    leaf_nodes = get_leaf_nodes(nodes)
    root_nodes = get_root_nodes(nodes)

    for node in nodes:
        node_embedding = embed_model.get_text_embedding(
            node.get_content(metadata_mode="all")
        )
        node.embedding = node_embedding
    len(nodes)
    db = 'staging'
    collection = 'abhay_test'

    vector_store = MongoDBAtlasVectorSearch(client, db_name=db, collection_name=collection, index_name="nanopore_index_1", embeddings = OpenAIEmbeddings())
    vector_store.add(nodes)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(nodes=nodes, storage_context=storage_context,store_nodes_override=True)        
    index = VectorStoreIndex(nodes = leaf_nodes, storage_context = storage_context, nodes_override= True)
    #index = VectorStoreIndex.from_vector_store(vector_store)

    postproc = None
    reranker = SentenceTransformerRerank(top_n = rr_metadata)
    retriever = index.as_retriever(similarity_top_k=retrieval_metadata_similarity)
    retriever = AutoMergingRetriever(retriever,index.storage_context,verbose=True)
    response_synthesizer = get_response_synthesizer(response_mode=response_mode)
    node_postprocessors = [postproc, reranker]
    node_postprocessors = [processor for processor in node_postprocessors if processor is not None]
    query_engine = RetrieverQueryEngine(retriever, node_postprocessors=node_postprocessors)
    query_engine = RetrieverQueryEngine(retriever)
    summary_whole = query_engine.query(rag_prompt_original)`

It's working now. Tell me however, a few things.

If I put leaf nodes instead of nodes in index, the fina, output is only marginally different (95% same)
The docstore part that you mentioned is already taken care by the below code vector_store.add(nodes) Can you confirm?
For actually storing the embeddings in mongodb, is this step necessary? for node in nodes: node_embedding = embed_model.get_text_embedding( node.get_content(metadata_mode="all") ) node.embedding = node_embedding Without this, step, its not possible to store embeddings in mongo db ? And then read it from mongodb?
LAstly, currently each node is stored separately like this but i want all the nodes embeddings to be stored in the same row of data which harbors the information about the rest of the node. under a key value pair. How to achieve this?

GildeshAbhay commented 5 months ago

one more thing, if I want to use those embeddings (mentioned above), what should i run?

logan-markewich commented 5 months ago

Yea, it might not make a huge difference, especially if the top k is low

The vector store and docstore are different. If enough nodes are retrieved from the vector store that have the same parent, the are replaced with their parent node (which is only in the docstore)

You can pre-calculate and attach the nodes like you are, but if you didn't, the same would be done under the hood if they were missing.

You can't change how it's stored in mongodb.

I don't know what you mean by using those embeddings?

GildeshAbhay commented 5 months ago

THanks again for taking the time to reply ! Appreciate it!

Yea, it might not make a huge difference, especially if the top k is low

Cool. So that's sorted.

The vector store and docstore are different. If enough nodes are retrieved from the vector store that have the same parent, the are replaced with their parent node (which is only in the docstore)

Can you please give me the codes for both and explicitly show the difference? Basically, I want to know how to add node/doc information to both and then retrieve both so that I don't waste keys on embeddings.

You can pre-calculate and attach the nodes like you are, but if you didn't, the same would be done under the hood if they were missing.

If I didn't write the node.embeddings lines, the index code gives an error. Says, "embed" not defined.

You can't change how it's stored in mongodb.

Okay. So If my overall goal is to store the embeddings in mongodb just so that I can retrieve it from there (whenever I want to use it, so that re-indexing doesn't have to happen), is it better to then store the embeddings locally as json files and then "upsert" it to mongodb?

I don't know what you mean by using those embeddings?

Using means, using them so that I don't have re-calculate the embeddings and waste tokens.

GildeshAbhay commented 5 months ago

@logan-markewich can you please help here

logan-markewich commented 4 months ago

@GildeshAbhay pretty lost tbh

The vector store stores all your embeddings. There's nothing wasted here.

The docstore stores nodes (I.e. in this case, the parent nodes, which are never embedded)

Let me walk through the auto merging algorithm step by step, seems there is some confusion

hierarchical node parser parses your nodes into hierarchies. By default, there is chunks of 2048, then that is chunked into 4 512 token chunks, and that is chunked again into 4 128 token chunks
you only embed and store the bottom level into your vector store (the 128 token chunks)
when using the auto merging retriever, if more than 2 retrieved nodes have the same parent, their parent is fetched from the docstore (no embeddings needed). So if at least 2 128 token chunks are retrieved that have the same parent, they are replaced with the parent 512 token chunk
after replacing, if there are still 2 or more nodes with the same parent, they are replaced with their 2048 token chunk parent, which is from the docstore
once nothing is left that is needed to be merged into a parent chunk, the final list of chunks is returned and given to the response synthesizer

YanSte commented 3 months ago

Did you solve your issue i have the same ..

YanSte commented 3 months ago

@GildeshAbhay

I have resolved the issue as follows:

Using MongoDB:

Created a MongoDBAtlasVectorSearch as the vector store.
Created a MongoDocumentStore as the document store.

Note: When using MongoDB with the Search Index Atlas, you need to manually create the index vector with doc vector. With a Cluster M10, you can simply code this process.

After that:

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

base_retriever = index.as_retriever(similarity_top_k=...)

storage_context = StorageContext.from_defaults(index_store=index, docstore=docstore)

retriever = AutoMergingRetriever(
    vector_retriever=base_retriever,
    storage_context=storage_context
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[rerank_model],
    llm=...
)

run-llama / llama_index