run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.8k stars 4.75k forks source link

[Bug]: How to use the same nodes for dense_x second time. Please Provide standard solution to resolve this bug #14336

Open LikhithRishi opened 3 weeks ago

LikhithRishi commented 3 weeks ago

Bug Description

getting error when we are using chromadb embeddings for second time , while entering a query (user input question) to generate response:

Query id 40b9d01f-b211-413d-b3d4-a799eff700d6 not found in either retriever_dict or query_engine_dict .

How to resolve this error?

detailed explanation: For the first time with new chromadb embeddings generated, it is successfully executed end to end like taking user input question and generated response from llm but when we stop the executuion and again re execute it then facing this queryID error

Version

llama-index==0.10.12

Steps to Reproduce

`class DenseXRetrievalPack(BaseLlamaPack): def init( self, documents: List[Document], proposition_llm: Optional[LLM] = None, query_llm: Optional[LLM] = None, embed_model: Optional[BaseEmbedding] = None, text_splitter: TextSplitter = SentenceSplitter(), vector_store: Optional[ElasticsearchStore] = None, similarity_top_k: int = 4, ) -> None: """Init params.""" self._proposition_llm = llm

embed_model = embed_model

nodes = text_splitter.get_nodes_from_documents(documents)
print(nodes)
sub_nodes = self._gen_propositions(nodes)
print(sub_nodes,"greg")
all_nodes = nodes + sub_nodes
all_nodes_dict = {n.node_id: n for n in all_nodes}

service_context = ServiceContext.from_defaults(
    llm=query_llm ,
    embed_model=embed_model,
    num_output=self._proposition_llm.metadata.num_output,
)
'''
if os.path.exists('./elastic_db'):
    print("From elasticsearch")
    self.vector_index = VectorStoreIndex.from_vector_store(vector_store,service_context=service_context)
else:
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    self.vector_index = VectorStoreIndex(
         all_nodes, service_context=service_context, show_progress=True,storage_context=storage_context
         )
    os.mkdir("elastic_db")
'''
if os.path.exists('./chroma_db'):
    chroma_client = chromadb.PersistentClient(path="./chroma_db")
    chroma_collection = chroma_client.get_or_create_collection("quickstart")
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    self.vector_index = VectorStoreIndex.from_vector_store(vector_store,service_context=service_context)
else:
   chroma_client = chromadb.PersistentClient(path="./chroma_db")
   chroma_collection = chroma_client.get_or_create_collection("quickstart")
   vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
   storage_context = StorageContext.from_defaults(vector_store=vector_store)
   self.vector_index = VectorStoreIndex(
        all_nodes, service_context=service_context, show_progress=True,storage_context=storage_context,store_nodes_override=True
        )
self.retriever = RecursiveRetriever(
    "vector",
    retriever_dict={
        "vector": self.vector_index.as_retriever(
            similarity_top_k=similarity_top_k
        )
    },
    node_dict=all_nodes_dict,
)

self.query_engine = RetrieverQueryEngine.from_args(
    self.retriever, service_context=service_context
)

async def _aget_proposition(self, node: TextNode) -> List[TextNode]: """Get proposition.""" inital_output = await self._proposition_llm.apredict( PROPOSITIONS_PROMPT, node_text=node.text ) outputs = inital_output.split("\n")

all_propositions = []

for output in outputs:
    if not output.strip():
        continue
    if not output.strip().endswith("]"):
        if not output.strip().endswith('"') and not output.strip().endswith(
            ","
        ):
            output = output + '"'
        output = output + " ]"
    if not output.strip().startswith("["):
        if not output.strip().startswith('"'):
            output = '"' + output
        output = "[ " + output

    try:
        propositions = json.loads(output)
    except Exception:
        # fallback to yaml
        try:
            propositions = yaml.safe_load(output)
        except Exception:
            # fallback to next output
            continue

    if not isinstance(propositions, list):
        continue

    all_propositions.extend(propositions)

assert isinstance(all_propositions, list)
nodes = [TextNode(text=prop) for prop in all_propositions if prop]

return [IndexNode.from_text_node(n, node.node_id) for n in nodes]

def _gen_propositions(self, nodes: List[TextNode]) -> List[TextNode]: """Get propositions.""" sub_nodes = asyncio.run( run_jobs( [self._aget_proposition(node) for node in nodes], show_progress=True, workers=8, ) ) print(sub_nodes)

# Flatten list
return [node for sub_node in sub_nodes for node in sub_node]

def get_modules(self) -> Dict[str, Any]: """Get modules.""" return { "query_engine": self.query_engine, "retriever": self.retriever, }

def run(self, query_str: str, **kwargs: Any) -> RESPONSE_TYPE: """Run the pipeline.""" return self.query_engine.query(query_str)`

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 3 weeks ago

To resolve the error "Query id 40b9d01f-b211-413d-b3d4-a799eff700d6 not found in either retriever_dict or query_engine_dict" and use the same nodes for dense_x the second time, follow these steps:

  1. Ensure query_id is in retriever_dict or query_engine_dict: Make sure that the query_id is included in either the retriever_dict or query_engine_dict when initializing the RecursiveRetriever.

  2. Check Initialization: Verify that the retriever_dict and query_engine_dict are correctly populated with the necessary retrievers and query engines.

Here is an example of how you might initialize the RecursiveRetriever:

retriever_dict = {
    "vector": self.vector_index.as_retriever(similarity_top_k=similarity_top_k),
    # Add other retrievers as needed
}

query_engine_dict = {
    # Add query engines if needed
}

self.retriever = RecursiveRetriever(
    "vector",
    retriever_dict=retriever_dict,
    node_dict=all_nodes_dict,
)
  1. Using the Same Nodes for dense_x the Second Time: If you want to use the same nodes for dense_x the second time, you need to store the nodes retrieved during the first query and reuse them. Here is an example of how you might do this:
# First retrieval
query_bundle = QueryBundle(query_str="your query")
retrieved_nodes, additional_nodes = self.retriever.retrieve_all(query_bundle)

# Store the nodes for later use
stored_nodes = retrieved_nodes

# Use the same nodes for dense_x the second time
dense_x_nodes = stored_nodes

By ensuring the query_id is present in the dictionaries and storing the nodes for reuse, you can resolve the error and use the same nodes for dense_x the second time.

Additionally, the RecursiveRetriever class in the LlamaIndex framework has a method _get_object that fetches the retriever or query engine based on the query_id. If the query_id is not found in either retriever_dict or query_engine_dict, it raises a ValueError. Ensure that your query_id is correctly mapped in these dictionaries to avoid this error [1][2].

To continue talking to Dosu, mention @dosu.