run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.91k stars 5.29k forks source link

[Bug]: ValueError: doc_id not found. #12115

Closed vecorro closed 8 months ago

vecorro commented 8 months ago

Bug Description

I'm trying to implement an AutoMergingRetriever but when submitting a query I'm getting a ValueError: doc_id 03ea05ed-3d9b-4edb-b4b8-43326224cf69 not found.

Version

0.10.20.post2

Steps to Reproduce

import ipywidgets as widgets
widgets.IntSlider()

import sys
import psycopg2

from sqlalchemy import make_url
from pprint import pprint

from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.readers.file import PyMuPDFReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.core.node_parser import (
    HierarchicalNodeParser,
    get_leaf_nodes,
)
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.indices.postprocessor import SentenceTransformerRerank

# LLM service settings
DEFAULT_LLM_MODEL = "WizardLM/WizardLM-70B-V1.0" 
DEFAULT_LLM_API_BASE = "http://localhost:8010/v1"
DEFAULT_LLM_API_KEY = "NO_KEY"
GEN_TEMP=0.1
MAX_TOKENS=512
REP_PENALTY=1.03

# Ingestion pipeline settings
NUM_WORKERS = 4
CHUNK_SIZE = 1024
MIN_DOC_LENGTH = 40 # Min number of words per doc
PDF_FILES_PATH = "../doc_collections"

# >> LLamaIndex settings

# LLamaIndex embedding model
EMB_MODEL="BAAI/bge-base-en-v1.5"
DEVICE="cuda:0"
Settings.embed_model = HuggingFaceEmbedding(
    model_name=EMB_MODEL,
    device=DEVICE
)
EMBEDDING_SIZE = len(Settings.embed_model.get_text_embedding("hi"))

# LLamaIndex LLM provider
Settings.llm = OpenAILike(
    model=DEFAULT_LLM_MODEL,
    api_key=DEFAULT_LLM_API_KEY,
    api_base=DEFAULT_LLM_API_BASE,
    temperature=GEN_TEMP,
    max_tokens=MAX_TOKENS,
    repetition_penalty=REP_PENALTY,
)

# Loading documents

filename_fn = lambda filename: {"file_name": filename.split("/")[-1]}

reader = SimpleDirectoryReader(
    input_dir=PDF_FILES_PATH,
    required_exts=[".pdf"],
    file_extractor={".pdf":PyMuPDFReader()},
    file_metadata=filename_fn,
    num_files_limit=10,
)
documents = reader.load_data()

node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)
nodes = node_parser.get_nodes_from_documents(
    documents=documents,
    show_progress=True,
)
leaf_nodes = get_leaf_nodes(nodes)

re_ranker = SentenceTransformerRerank(
    top_n=6,
    model="BAAI/bge-reranker-base",
)

# PGVector store setup 
DB_PORT = 5432
DB_USER = "demouser"
DB_PASSWD = "demopasswd"
DEFAULT_DB = "postgres"
DB_NAME = "vectordb"
DB_HOST = "localhost"
TABLE_NAME = "HISTORY_BOOKS_AUTO_MERGING_INDEX"

connection_string = f"postgresql://{DB_USER}:{DB_PASSWD}@{DB_HOST}:{DB_PORT}/{DEFAULT_DB}"
url = make_url(connection_string)

conn = psycopg2.connect(connection_string)
cursor = conn.cursor()
sql = f"DROP TABLE IF EXISTS {TABLE_NAME}"
cursor.execute(sql)
print(f"Table {TABLE_NAME} dropped")
conn.commit()
conn.close()

vector_store = PGVectorStore.from_params(
    database=DB_NAME,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name=TABLE_NAME,
    embed_dim=EMBEDDING_SIZE, # embedding model dimension
    cache_ok=True,
    hybrid_search=True,
)

storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
)

auto_merging_index = VectorStoreIndex(
    nodes=leaf_nodes,
    storage_context=storage_context,
    show_progress=True,
    transformations=None,
)

auto_merging_idx_as_retriever = auto_merging_index.as_retriever(
    similarity_top_k=12
)

retriever = AutoMergingRetriever(
    vector_retriever=auto_merging_idx_as_retriever,
    storage_context=auto_merging_index.storage_context,
    verbose=True
)

auto_merging_engine = RetrieverQueryEngine.from_args(
    retriever=retriever, 
    node_postprocessors=[re_ranker],
)

question = "What are the main Hubble telescope discoveries about exoplanets?"
print(f" > Question: {question}")
response = auto_merging_engine.query(question)
print(f" > Response:\n", response.response)

ValueError: doc_id 03ea05ed-3d9b-4edb-b4b8-43326224cf69 not found.

Relevant Logs/Tracbacks

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:4

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/base/base_query_engine.py:40, in BaseQueryEngine.query(self, str_or_query_bundle)
     38 if isinstance(str_or_query_bundle, str):
     39     str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 40 return self._query(str_or_query_bundle)

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/query_engine/retriever_query_engine.py:186, in RetrieverQueryEngine._query(self, query_bundle)
    182 """Answer a query."""
    183 with self.callback_manager.event(
    184     CBEventType.QUERY, payload={EventPayload.QUERY_STR: query_bundle.query_str}
    185 ) as query_event:
--> 186     nodes = self.retrieve(query_bundle)
    187     response = self._response_synthesizer.synthesize(
    188         query=query_bundle,
    189         nodes=nodes,
    190     )
    192     query_event.on_end(payload={EventPayload.RESPONSE: response})

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/query_engine/retriever_query_engine.py:142, in RetrieverQueryEngine.retrieve(self, query_bundle)
    141 def retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
--> 142     nodes = self._retriever.retrieve(query_bundle)
    143     return self._apply_node_postprocessors(nodes, query_bundle=query_bundle)

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/base/base_retriever.py:229, in BaseRetriever.retrieve(self, str_or_query_bundle)
    224 with self.callback_manager.as_trace("query"):
    225     with self.callback_manager.event(
    226         CBEventType.RETRIEVE,
    227         payload={EventPayload.QUERY_STR: query_bundle.query_str},
    228     ) as retrieve_event:
--> 229         nodes = self._retrieve(query_bundle)
    230         nodes = self._handle_recursive_retrieval(query_bundle, nodes)
    231         retrieve_event.on_end(
    232             payload={EventPayload.NODES: nodes},
    233         )

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/retrievers/auto_merging_retriever.py:173, in AutoMergingRetriever._retrieve(self, query_bundle)
    166 """Retrieve nodes given query.
    167 
    168 Implemented by the user.
    169 
    170 """
    171 initial_nodes = self._vector_retriever.retrieve(query_bundle)
--> 173 cur_nodes, is_changed = self._try_merging(initial_nodes)
    174 # cur_nodes, is_changed = self._get_parents_and_merge(initial_nodes)
    175 while is_changed:

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/retrievers/auto_merging_retriever.py:162, in AutoMergingRetriever._try_merging(self, nodes)
    160 nodes, is_changed_0 = self._fill_in_nodes(nodes)
    161 # then try merging nodes
--> 162 nodes, is_changed_1 = self._get_parents_and_merge(nodes)
    163 return nodes, is_changed_0 or is_changed_1

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/retrievers/auto_merging_retriever.py:64, in AutoMergingRetriever._get_parents_and_merge(self, nodes)
     62 parent_node_id = parent_node_info.node_id
     63 if parent_node_id not in parent_nodes:
---> 64     parent_node = self._storage_context.docstore.get_document(
     65         parent_node_id
     66     )
     67     parent_nodes[parent_node_id] = cast(BaseNode, parent_node)
     69 # add reference to child from parent

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/storage/docstore/keyval_docstore.py:262, in KVDocumentStore.get_document(self, doc_id, raise_error)
    260 if json is None:
    261     if raise_error:
--> 262         raise ValueError(f"doc_id {doc_id} not found.")
    263     else:
    264         return None

ValueError: doc_id 03ea05ed-3d9b-4edb-b4b8-43326224cf69 not found.
dosubot[bot] commented 8 months ago

Hey @vecorro, nice to see you diving into the depths of LlamaIndex again! What's brewing this time?

The ValueError: doc_id not found error you're encountering when using the AutoMergingRetriever in LlamaIndex version 0.10.20.post2 is likely not directly related to the code snippet you've provided for the VectorIndexAutoRetriever class. This error typically indicates an issue with accessing or retrieving a document by its ID from the underlying data store or index.

Given the context and the code snippet, here are a few potential reasons and areas to investigate:

  1. Document ID Misalignment: Ensure that the document IDs referenced in your queries exist in the vector store index. It's possible that the document ID being queried does not match any document in the index, leading to this error.

  2. Index Synchronization Issues: If documents were recently added or removed from the index, ensure that the index is properly synchronized and updated. There might be a lag or issue in the synchronization process, causing discrepancies between the expected and actual document IDs in the index.

  3. Incorrect Query Specification: Review the logic in the _parse_generated_spec and _build_retriever_from_spec methods to ensure that the document ID is being correctly parsed and used in queries. Any issues in how the query spec is generated or interpreted could lead to attempts to access non-existent document IDs.

  4. Error in VectorIndexRetriever or Underlying Components: While the provided code snippet focuses on the VectorIndexAutoRetriever, the actual retrieval and document ID resolution happen deeper in the call stack, possibly within the VectorIndexRetriever or even further down. Review the implementation of these components to ensure they correctly handle document IDs and queries.

To resolve the issue, you should:

Without more specific details on how the AutoMergingRetriever is implemented or how it interacts with the VectorIndexAutoRetriever and the underlying index, these are the most likely areas to investigate based on the provided context.

For more information, you can refer to the following sources:

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 8 months ago

The docstore is not used when using a vector db integration. I suggest populating one manually

docstore.add_documents(nodes) docstore.add_documents(leaf_nodes)

and passing it in AutoMergingRetriever(..., StorageContext.from_defaults(docstore=docstore)

There is a Postgres docstore as well

pip install llama-index-storage-docstore-postgres from llama_index.storage.docstore import PostgresDocumentStore

https://github.com/run-llama/llama_index/blob/2ba13544cd2583418cbeade5bea45ff1da7bb7b8/llama-index-integrations/storage/docstore/llama-index-storage-docstore-postgres/llama_index/storage/docstore/postgres/base.py#L8