[Question]: Getting error doc_id XXXXXXXXXXXXXXXXXX not found after running "query_engine.query(rag_prompt)"

GildeshAbhay commented 6 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

  Cell In[255], line 1
    summary_whole = query_engine.query(rag_prompt) #len(summary.source_nodes)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:102 in wrapper
    self.span_drop(*args, id=id, err=e, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:77 in span_drop
    h.span_drop(*args, id=id, err=err, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\span_handlers\base.py:47 in span_drop
    self.prepare_to_drop_span(*args, id=id, err=err, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\span_handlers\null.py:35 in prepare_to_drop_span
    raise err

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:100 in wrapper
    result = func(*args, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\base\base_query_engine.py:51 in query
    query_result = self._query(str_or_query_bundle)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:102 in wrapper
    self.span_drop(*args, id=id, err=e, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:77 in span_drop
    h.span_drop(*args, id=id, err=err, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\span_handlers\base.py:47 in span_drop
    self.prepare_to_drop_span(*args, id=id, err=err, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\span_handlers\null.py:35 in prepare_to_drop_span
    raise err

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:100 in wrapper
    result = func(*args, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\query_engine\retriever_query_engine.py:189 in _query
    nodes = self.retrieve(query_bundle)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\query_engine\retriever_query_engine.py:144 in retrieve
    nodes = self._retriever.retrieve(query_bundle)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:102 in wrapper
    self.span_drop(*args, id=id, err=e, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:77 in span_drop
    h.span_drop(*args, id=id, err=err, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\span_handlers\base.py:47 in span_drop
    self.prepare_to_drop_span(*args, id=id, err=err, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\span_handlers\null.py:35 in prepare_to_drop_span
    raise err

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:100 in wrapper
    result = func(*args, **kwargs)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\base\base_retriever.py:238 in retrieve
    nodes = self._retrieve(query_bundle)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\retrievers\auto_merging_retriever.py:173 in _retrieve
    cur_nodes, is_changed = self._try_merging(initial_nodes)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\retrievers\auto_merging_retriever.py:162 in _try_merging
    nodes, is_changed_1 = self._get_parents_and_merge(nodes)

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\retrievers\auto_merging_retriever.py:64 in _get_parents_and_merge
    parent_node = self._storage_context.docstore.get_document(

  File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\storage\docstore\keyval_docstore.py:279 in get_document
    raise ValueError(f"doc_id {doc_id} not found.")

ValueError: doc_id 1f7164c7-624b-45c4-b040-13a8ff8a806b not found.

Above is the error that i got from the below code

doc = Document(text=content)
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size)
nodes = node_parser.get_nodes_from_documents([document])
nodes = get_leaf_nodes(nodes)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model,node_parser=node_parser)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)
pathname = os.path.join(pathname, document['unique_id'])
index.storage_context.persist(persist_dir=pathname)
postproc = None
reranker = SentenceTransformerRerank(top_n = rr_metadata)
retriever_type, retrieval_metadata = retrieval_args
retriever = index.as_retriever(similarity_top_k=retrieval_metadata)
retriever = AutoMergingRetriever(retriever,index.storage_context,verbose=True)
response_mode = 'tree_summarize'
response_synthesizer = get_response_synthesizer(response_mode=response_mode)
node_postprocessors = [postproc, rerank]
node_postprocessors = [processor for processor in node_postprocessors if processor is not None]
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=node_postprocessors)
summary_whole = query_engine.query(rag_prompt)

dosubot[bot] commented 6 months ago

Hey @GildeshAbhay, nice to see you running through the digital fields of LlamaIndex again! 🌐

Based on the error message you're encountering, it seems like the document with the specified doc_id is not present in the index when you attempt to query it using the query_engine.query(rag_prompt) command. This could be due to several reasons. Here are a few steps you can take to resolve the issue:

Ensure Complete Indexing: Make sure that the indexing process has successfully completed and that all documents have been indexed. The indexing process involves loading documents from the ./data directory, processing them through the pipeline, and then adding them to the index. Any failure or interruption in this process could result in missing documents.
Check Document IDs: Ensure that the document IDs generated during the indexing process match the IDs being queried. It's possible that the document ID 1f7164c7-624b-45c4-b040-13a8ff8a806b does not exist in the index due to an inconsistency in how IDs are assigned or managed.
Review the Data Loader: The SimpleDirectoryReader is responsible for loading documents from the ./data directory. Ensure that this directory contains the expected documents and that the SimpleDirectoryReader is correctly reading and loading these documents into the pipeline.
Debug the Ingestion Pipeline: The IngestionPipeline processes documents before they are indexed. Ensure that the pipeline's transformations, specifically the SentenceSplitter, are not inadvertently removing or altering document IDs in a way that could cause them to be missing or mismatched in the index.
Inspect Index Creation: When creating the VectorStoreIndex, ensure that all nodes (documents after processing) are correctly passed to the index along with the embedding model. Any issues here could result in an incomplete or incorrect index.
Verify Query Execution: Finally, ensure that the doc_id being queried actually exists in the index. You can add debugging statements or logs to verify the contents of the index before querying.

If after checking these steps the issue persists, it might be helpful to manually inspect the index or add logging to the indexing and querying processes to gather more detailed information about where the discrepancy arises.

You can find more details about these steps in the LlamaIndex repository.

Regarding the doc_id, it is referred to as id_ within the BaseNode class and is a string that is generated by default using the uuid.uuid4() method, which means it is expected to be a universally unique identifier (UUID) in string format. You can find more details about this in the LlamaIndex legacy schema.

The query_engine.query(rag_prompt) command in the LlamaIndex framework is designed to process a given query (in this context, referred to as rag_prompt) through a defined QueryPipeline. This pipeline is a sequence of processing steps that handle the query to generate a response. The retriever component of the QueryPipeline interacts with the document store. Specifically, it uses a VectorStoreIndex to retrieve relevant documents from the document store based on the query. You can find more details about this in the LlamaIndex legacy command line rag.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

GildeshAbhay commented 6 months ago

I got the error it was due to the leaf_nodes step.

However, now iam getting another problem. I am getting same RAG output (summary) from two separate documents. Is there a way to check whether the index iam making is the same or different?

logan-markewich commented 6 months ago

You can check response.source_nodes to see the nodes used to create the response

response.source_nodes[0].text for example

I'd need more details to debug further

run-llama / llama_index