Multi Document Retriever not retrieving all information

VedantR3907 commented 6 days ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Is there a way in the llama index to retrieve and show all the accurate information to a related query where there are many documents? Because I tried to upload at least 50 documents and when a query occurs where we have many context available in different documents it takes the most relevant one and shows the information from that document, But is there a way to get all the information available for the query written by the user. So I will get all the context from the documents.

dosubot[bot] commented 6 days ago

Hey there, @VedantR3907! 👋 I'm here to help you with any bugs, questions, or contributions. I'm a bot designed to assist you while waiting for a human maintainer.

Yes, you can configure the Llama Index to retrieve and show more context related to a query from all available documents by adjusting the similarity_top_k parameter. This parameter controls the number of top similar documents to retrieve.

Here is an example of how to configure it:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What did the author do growing up?")
print(response)

In this example, the similarity_top_k=5 argument configures the retriever to return the top 5 most similar documents instead of the default of 2.

Additionally, you can use the CitationQueryEngine for hierarchical retrieval from multiple documents, ensuring that each piece of information is properly cited:

from llama_index.core.query_engine.citation_query_engine import CitationQueryEngine
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.core.prompts import PromptTemplate

# Define your retriever and other necessary components
retriever = BaseRetriever(...)
citation_qa_template = PromptTemplate(...)
citation_refine_template = PromptTemplate(...)

# Initialize the CitationQueryEngine
citation_query_engine = CitationQueryEngine(
    retriever=retriever,
    citation_qa_template=citation_qa_template,
    citation_refine_template=citation_refine_template,
    citation_chunk_size=512,
    citation_chunk_overlap=20,
)

# Perform a query
response = citation_query_engine.query("Tell me about some open issues related to agents")
print(response)

This setup allows you to retrieve and combine information from multiple documents, providing a more comprehensive response to your queries ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

VedantR3907 commented 6 days ago

Thanks for the reply, I already tried adjusting the similarity_top_k value, But still the result are generated from document where there is more relevance and it skips the information from the other docs.

dosubot[bot] commented 6 days ago

To ensure that the query retrieves and combines information from all relevant documents, not just the most relevant one, you can adjust the choice_top_k parameter in the DocumentSummaryIndexLLMRetriever class. This parameter controls the number of summary nodes to retrieve. By increasing this value, you can retrieve more relevant documents.

Here is the relevant code snippet:

class DocumentSummaryIndexLLMRetriever(BaseRetriever):
    """Document Summary Index LLM Retriever.

    By default, select relevant summaries from index using LLM calls.

    Args:
        index (DocumentSummaryIndex): The index to retrieve from.
        choice_select_prompt (Optional[BasePromptTemplate]): The prompt to use for selecting relevant summaries.
        choice_batch_size (int): The number of summary nodes to send to LLM at a time.
        choice_top_k (int): The number of summary nodes to retrieve.
        format_node_batch_fn (Callable): Function to format a batch of nodes for LLM.
        parse_choice_select_answer_fn (Callable): Function to parse LLM response.
        service_context (ServiceContext): The service context to use.
    """

    def __init__(
        self,
        index: DocumentSummaryIndex,
        choice_select_prompt: Optional[BasePromptTemplate] = None,
        choice_batch_size: int = 10,
        choice_top_k: int = 1,
        format_node_batch_fn: Optional[Callable] = None,
        parse_choice_select_answer_fn: Optional[Callable] = None,
        service_context: Optional[ServiceContext] = None,
        callback_manager: Optional[CallbackManager] = None,
        object_map: Optional[dict] = None,
        verbose: bool = False,
        **kwargs: Any,
    ) -> None:
        self._index = index
        self._choice_select_prompt = (
            choice_select_prompt or DEFAULT_CHOICE_SELECT_PROMPT
        )
        self._choice_batch_size = choice_batch_size
        self._choice_top_k = choice_top_k
        self._format_node_batch_fn = (
            format_node_batch_fn or default_format_node_batch_fn
        )
        self._parse_choice_select_answer_fn = (
            parse_choice_select_answer_fn or default_parse_choice_select_answer_fn
        )
        self._service_context = service_context or index.service_context
        super().__init__(
            callback_manager=callback_manager, object_map=object_map, verbose=verbose
        )

To retrieve and combine information from more documents, you can set choice_top_k to a higher value when initializing the DocumentSummaryIndexLLMRetriever:

retriever = DocumentSummaryIndexLLMRetriever(
    index=my_index,
    choice_top_k=5  # Increase this value to retrieve more documents
)

Additionally, you can set the similarity_top_k parameter when querying to ensure that multiple nodes are considered:

response = retriever.query(
    query="List 2 types of vegetables and how can I grow them",
    similarity_top_k=5  # Adjust this value to retrieve more nodes
)

This will ensure that the retriever considers more documents, not just the most relevant one, and combines information from multiple sources ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

VedantR3907 commented 5 days ago

I tried it still the results are same is there any other way to get each and every information from the document. Cause the nodes extracted are really great. But in the final response it is getting removed.

dosubot[bot] commented 5 days ago

To ensure that each and every piece of information from multiple documents is included in the final response, you can use the multi-document queries feature and the Multidoc Autoretrieval Pack. These tools help in retrieving and combining information from multiple documents comprehensively.

Here is an example using the Multidoc Autoretrieval Pack:

from llama_index.core import SimpleDirectoryReader
from llama_index.packs.dense_x_retrieval import DenseXRetrievalPack

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Use the LLM to extract propositions from every document/node
dense_pack = DenseXRetrievalPack(documents)

# Run the pack
response = dense_pack.run("Physical Standards for Letters?")
print(response)

Additionally, you can configure and use the LlamaIndexTool to ensure comprehensive coverage:

from llama_index.core.base.base_query_engine import BaseQueryEngine
from llama_index.core.bridge.langchain import BaseTool
from llama_index.core.bridge.pydantic import BaseModel, Field

class IndexToolConfig(BaseModel):
    query_engine: BaseQueryEngine
    name: str
    description: str
    tool_kwargs: Dict = Field(default_factory=dict)

class LlamaIndexTool(BaseTool):
    query_engine: BaseQueryEngine
    return_sources: bool = False

    @classmethod
    def from_tool_config(cls, tool_config: IndexToolConfig) -> "LlamaIndexTool":
        return_sources = tool_config.tool_kwargs.pop("return_sources", False)
        return cls(
            query_engine=tool_config.query_engine,
            name=tool_config.name,
            description=tool_config.description,
            return_sources=return_sources,
            **tool_config.tool_kwargs,
        )

    def _run(self, input: str) -> str:
        response = self.query_engine.query(input)
        if self.return_sources:
            return _get_response_with_sources(response)
        return str(response)

    async def _arun(self, input: str) -> str:
        response = await self.query_engine.aquery(input)
        if self.return_sources:
            return _get_response_with_sources(response)
        return str(response)

By using these tools, you can ensure that all relevant information from multiple documents is included in the final response ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

VedantR3907 commented 5 days ago

I am using Pinecone as a vector store, Is there a way to get it from there?

dosubot[bot] commented 5 days ago

To ensure that each and every piece of information from all relevant documents is included in the final response when using Pinecone as the vector store with the Llama Index, you need to make sure that the include_metadata and include_values parameters are set to True during the query. This will ensure that both the metadata and the embeddings of the documents are included in the response.

Here is the relevant code snippet from the PineconeVectorStore class:

response = self._pinecone_index.query(
    vector=query_embedding,
    sparse_vector=sparse_vector,
    top_k=query.similarity_top_k,
    include_values=True,
    include_metadata=True,
    namespace=self.namespace,
    filter=filter,
    **kwargs,
)

top_k_nodes = []
top_k_ids = []
top_k_scores = []
for match in response.matches:
    try:
        node = metadata_dict_to_node(match.metadata)
        node.embedding = match.values
    except Exception:
        _logger.debug("Failed to parse Node metadata, fallback to legacy logic.")
        metadata, node_info, relationships = legacy_metadata_dict_to_node(
            match.metadata, text_key=self.text_key
        )
        text = match.metadata[self.text_key]
        id = match.id
        node = TextNode(
            text=text,
            id_=id,
            metadata=metadata,
            start_char_idx=node_info.get("start", None),
            end_char_idx=node_info.get("end", None),
            relationships=relationships,
        )
    top_k_ids.append(match.id)
    top_k_nodes.append(node)
    top_k_scores.append(match.score)

return VectorStoreQueryResult(
    nodes=top_k_nodes, similarities=top_k_scores, ids=top_k_ids
)

Additionally, you can enhance the retrieval process by setting up auto-retrieval with Pinecone and Llama Index. This involves defining an AutoRetriever and running sample queries to ensure that all relevant information is captured. You can also customize the auto-retrieval prompt and implement dynamic metadata retrieval to improve the accuracy and completeness of the information retrieved.

Here is an example of setting up auto-retrieval:

from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo

vector_store_info = VectorStoreInfo(
    content_info="famous books and movies",
    metadata_info=[
        MetadataInfo(
            name="director",
            type="str",
            description=("Name of the director"),
        ),
        MetadataInfo(
            name="theme",
            type="str",
            description=("Theme of the book/movie"),
        ),
        MetadataInfo(
            name="year",
            type="int",
            description=("Year of the book/movie"),
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    empty_query_top_k=10,
    default_empty_query_vector=[0] * 1536,
    verbose=True,
)

By setting include_values and include_metadata to True and using auto-retrieval techniques, you ensure that the query response includes all the necessary information from the relevant documents ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

Multi Document Retriever not retrieving all information #14426

Question Validation

Question