Closed GildeshAbhay closed 1 week ago
Hey there @GildeshAbhay! Good to see you diving into more adventurous territories with RAG and MongoDB. Let's see if we can tackle this error together.
Based on the error message you provided, it seems like the issue is related to the retrieval of a document with a specific doc_id
from your MongoDB database. The error message ValueError: doc_id ae078fc5-a18a-4d9c-a36b-56aa2793f381 not found.
indicates that the document with the doc_id
ae078fc5-a18a-4d9c-a36b-56aa2793f381
could not be found in the database when the query engine tried to retrieve it.
Changing the embedding dimensions would not resolve this issue. The dimensions of the embeddings are related to how the data is represented in the vector space, but the doc_id
not found error suggests that there is a mismatch or an issue in how documents are indexed or queried in the database.
Here are a few things you could check to resolve this issue:
Verify that the document exists in the database: Check your MongoDB database to ensure that the document with the doc_id
ae078fc5-a18a-4d9c-a36b-56aa2793f381
actually exists. You can do this by running a query directly on your MongoDB database.
Check the doc_id
used in the query: Ensure that the doc_id
used in your query matches the actual doc_id
stored in the database. If there is a mismatch, this could be the cause of the error.
Check the database and collection names: Make sure that the database name (db
) and collection name (collection
) you're using in your code match the actual database and collection names in MongoDB.
Ensure the vector search index is correctly created: The VectorStoreIndex
class in LlamaIndex handles the indexing of nodes and their embeddings. Make sure that the vector search index is correctly created and matches the configuration expected by your application.
For more information on how LlamaIndex handles document indexing and retrieval in MongoDB, you can refer to the MongoDocumentStore class and the VectorStoreIndex class in the LlamaIndex repository.
I hope this helps! If you have any more questions or if the issue persists, please provide more details about your setup and the steps you've taken so far.
I checked, its not a document id at all. It's most likely a node id
@logan-markewich can you help please
@GildeshAbhay when using a vector db integration, only the vector store is used. However, you need access to more than just the nodes in the vector store for this to work.
You should either manually populate (and persist) the docstore on the storage context, or set store_nodes_override=True
when creating your index.
Furthermore, for auto-merging retriever to even work, I think you are missing a step? Normalizing you'd add all nodes to your docstore, and only index the leaf nodes. https://docs.llamaindex.ai/en/latest/examples/retrievers/auto_merging_retriever/?h=auto+mer
thanks a lot for responding!!
Here, I edited the code a bit.
` doc = Document(text=content) node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size) nodes = node_parser.get_nodes_from_documents([doc])
leaf_nodes = get_leaf_nodes(nodes)
root_nodes = get_root_nodes(nodes)
for node in nodes:
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
len(nodes)
db = 'staging'
collection = 'abhay_test'
vector_store = MongoDBAtlasVectorSearch(client, db_name=db, collection_name=collection, index_name="nanopore_index_1", embeddings = OpenAIEmbeddings())
vector_store.add(nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes=nodes, storage_context=storage_context,store_nodes_override=True)
index = VectorStoreIndex(nodes = leaf_nodes, storage_context = storage_context, nodes_override= True)
#index = VectorStoreIndex.from_vector_store(vector_store)
postproc = None
reranker = SentenceTransformerRerank(top_n = rr_metadata)
retriever = index.as_retriever(similarity_top_k=retrieval_metadata_similarity)
retriever = AutoMergingRetriever(retriever,index.storage_context,verbose=True)
response_synthesizer = get_response_synthesizer(response_mode=response_mode)
node_postprocessors = [postproc, reranker]
node_postprocessors = [processor for processor in node_postprocessors if processor is not None]
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=node_postprocessors)
query_engine = RetrieverQueryEngine(retriever)
summary_whole = query_engine.query(rag_prompt_original)`
It's working now. Tell me however, a few things.
vector_store.add(nodes)
Can you confirm?for node in nodes: node_embedding = embed_model.get_text_embedding( node.get_content(metadata_mode="all") ) node.embedding = node_embedding
Without this, step, its not possible to store embeddings in mongo db ? And then read it from mongodb? one more thing, if I want to use those embeddings (mentioned above), what should i run?
Yea, it might not make a huge difference, especially if the top k is low
The vector store and docstore are different. If enough nodes are retrieved from the vector store that have the same parent, the are replaced with their parent node (which is only in the docstore)
You can pre-calculate and attach the nodes like you are, but if you didn't, the same would be done under the hood if they were missing.
You can't change how it's stored in mongodb.
I don't know what you mean by using those embeddings?
THanks again for taking the time to reply ! Appreciate it!
Yea, it might not make a huge difference, especially if the top k is low
Cool. So that's sorted.
The vector store and docstore are different. If enough nodes are retrieved from the vector store that have the same parent, the are replaced with their parent node (which is only in the docstore)
Can you please give me the codes for both and explicitly show the difference? Basically, I want to know how to add node/doc information to both and then retrieve both so that I don't waste keys on embeddings.
You can pre-calculate and attach the nodes like you are, but if you didn't, the same would be done under the hood if they were missing.
If I didn't write the node.embeddings lines, the index code gives an error. Says, "embed" not defined.
You can't change how it's stored in mongodb.
Okay. So If my overall goal is to store the embeddings in mongodb just so that I can retrieve it from there (whenever I want to use it, so that re-indexing doesn't have to happen), is it better to then store the embeddings locally as json files and then "upsert" it to mongodb?
I don't know what you mean by using those embeddings?
Using means, using them so that I don't have re-calculate the embeddings and waste tokens.
@logan-markewich can you please help here
@GildeshAbhay pretty lost tbh
The vector store stores all your embeddings. There's nothing wasted here.
The docstore stores nodes (I.e. in this case, the parent nodes, which are never embedded)
Let me walk through the auto merging algorithm step by step, seems there is some confusion
Did you solve your issue i have the same ..
@GildeshAbhay
I have resolved the issue as follows:
Using MongoDB:
MongoDBAtlasVectorSearch
as the vector store.MongoDocumentStore
as the document store.Note: When using MongoDB with the Search Index Atlas, you need to manually create the index vector with doc vector. With a Cluster M10, you can simply code this process.
After that:
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
base_retriever = index.as_retriever(similarity_top_k=...)
storage_context = StorageContext.from_defaults(index_store=index, docstore=docstore)
retriever = AutoMergingRetriever(
vector_retriever=base_retriever,
storage_context=storage_context
)
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
node_postprocessors=[rerank_model],
llm=...
)
Question Validation
Question
So i am running a basic RAG application and storing embeddings in mongodb, but when I run the query engine, I get the error doc_id not found.
Here is my basic code ` content = document["content"] doc = Document(text=content) node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size) nodes = node_parser.get_nodes_from_documents([doc])
` I guess the problem is that one of the nodes is not found ? Would changing the embedding dimensions help?