run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.78k stars 5.27k forks source link

[Question]: How to ask questions on particular document in VectorDB? #15844

Open Neethan54 opened 2 months ago

Neethan54 commented 2 months ago

Question Validation

Question

Hello all,

How can we do filtering by a particular document name and do Q&A only on this document in LLamaIndex. Im using below code as Retiever .

Please let me know , is there any option In LlamaIndex for filtering retriever = index.as_retriever()

dosubot[bot] commented 2 months ago

To filter by a particular document name and perform Q&A only on that document in LlamaIndex using the retriever, you can modify the retrieval process to include a filter based on the document name. Here is how you can do it:

  1. Load the specific document: Ensure you load the document you want to filter by.
  2. Filter nodes by document name: When querying, filter the nodes to include only those from the specified document.

Here is the modified code to achieve this:

# Assuming you have already loaded your documents and created nodes as shown in the provided content

# Filter nodes by document name
document_name = "llama2.pdf"  # Replace with your document name
filtered_nodes = [node for node in nodes if node.metadata.get('file_name') == document_name]

# Generate embeddings for the filtered nodes
for node in filtered_nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

# Load filtered nodes into the vector store
vector_store.add(filtered_nodes)

# Define the retriever to use the filtered vector store
retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

# Plug this into our RetrieverQueryEngine to synthesize a response
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

# Perform the query
query_str = "How does Llama 2 perform compared to other open-source models?"
response = query_engine.query(query_str)

print(str(response))
print(response.source_nodes[0].get_content())

In this code:

Additionally, you can use the concept of "document agents" for more advanced decision-making over heterogeneous documents. This involves creating a separate document agent for each document, which can dynamically choose to perform semantic search or summarization within a given document. Here is an example of how to set up document agents:

from llama_index.agent.openai import OpenAIAgent

# Build agents dictionary
agents = {}

# Assuming city_docs is a dictionary with document names as keys and document contents as values
for doc_name, doc_content in city_docs.items():
    # build vector index
    vector_index = VectorStoreIndex.from_documents(doc_content)
    # build summary index
    summary_index = SummaryIndex.from_documents(doc_content)
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=f"Useful for retrieving specific context from {doc_name}",
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=f"Useful for summarization questions related to {doc_name}",
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-3.5-turbo-0613")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
    )

    agents[doc_name] = agent

This setup allows you to dynamically choose the appropriate tool (semantic search or summarization) for each document during the query process.

For a more structured approach, you can use the MultiDocAutoRetrieverPack from LlamaIndex, which allows for hierarchical retrieval across multiple documents. Here is an example of how to set it up:

from llama_index.core.llama_pack import download_llama_pack

# Download the MultiDocAutoRetrieverPack
MultiDocAutoRetrieverPack = download_llama_pack(
    "MultiDocAutoRetrieverPack", "./multidoc_autoretriever_pack"
)

# Setup the pack with your documents and metadata
pack = MultiDocAutoRetrieverPack(
    client,
    doc_metadata_index_name,
    doc_chunks_index_name,
    new_docs,
    docs,
    vector_store_info,
    auto_retriever_kwargs={
        "verbose": True,
        "similarity_top_k": 2,
        "empty_query_top_k": 10,
    },
    verbose=True,
)

# Run the pack on a query
response = pack.run("Tell me about some issues on 12/11")
print(str(response))

# Retrieve nodes using the retriever module
retriever = pack.get_modules()["recursive_retriever"]
nodes = retriever.retrieve("Tell me about some open issues related to agents")
print(f"Number of source nodes: {len(nodes)}")
print(nodes[0].node.metadata)

This approach leverages the MultiDocAutoRetrieverPack for more efficient and structured retrieval across multiple documents [1][2][3][4].

To continue talking to Dosu, mention @dosu.