run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.9k stars 5.28k forks source link

[Question]: How can i return page number as source in Rag. #15321

Open arunnuve opened 3 months ago

arunnuve commented 3 months ago

Question Validation

Question

This is my code: from llama_index.core.node_parser import MarkdownElementNodeParser node_parser = MarkdownElementNodeParser( llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8 ) nodes = node_parser.get_nodes_from_documents(documents) base_nodes, objects = node_parser.get_nodes_and_objects(nodes) recursive_index = VectorStoreIndex(nodes=base_nodes + objects) from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker( top_n=5, model="BAAI/bge-reranker-large", )

recursive_query_engine = recursive_index.as_query_engine( similarity_top_k=5, node_postprocessors=[reranker], verbose=True ) query = "Explain about uber?"

response_2 = recursive_query_engine.query(query) Now it gives only response i need to get the page number where it is retreived from

dosubot[bot] commented 3 months ago

To return the page number as a source in your retrieval-augmented generation (RAG) setup, you need to ensure that the metadata for each Node includes the page number. This metadata can then be used during the querying stage to provide the page number along with the retrieved context.

Here's how you can modify your code to include page numbers:

  1. Loading Stage: Ensure that each Node includes metadata for the page number when loading your documents.

  2. Indexing Stage: Index the documents with the metadata.

  3. Querying Stage: Retrieve the nodes and include the page number in the response.

Here is an example of how you can achieve this:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# Step 1: Load your document and create nodes with page number metadata
reader = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"])
data = reader.load_data()

# Assuming your documents have page numbers in their metadata
for doc in data:
    for page_num, page_content in enumerate(doc.pages):
        node = MarkdownElementNodeParser(
            llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
        ).get_nodes_from_documents([page_content])
        for n in node:
            n.metadata['page_number'] = page_num + 1  # Adding page number metadata

# Step 2: Index the nodes
index = VectorStoreIndex.from_documents(data)

# Step 3: Set up the query engine with reranker
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)
query_engine = index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

# Step 4: Query the index
query = "Explain about uber?"
response = query_engine.query(query)

# Step 5: Stream response with page citation
response.print_response_stream()

# Inspect source nodes to include page numbers in the response
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")

In this example:

This approach ensures that the page number is part of the metadata and can be returned as part of the response in your RAG setup [1][2].

To continue talking to Dosu, mention @dosu.

arunnuve commented 3 months ago

@dosu this how iam loading my documents iam not using SimpleDirectoryReader instead iam using llamaparse,Here is snippet of code documents = LlamaParse(result_type="markdown").load_data("uber2k.pdf")

dosubot[bot] commented 3 months ago

To include page numbers in the metadata when using LlamaParse, you can modify your code to ensure that each Node includes the page number. Here's how you can do it:

  1. Modify the TextNode creation to include page number metadata.
  2. Ensure that the metadata is indexed.
  3. Retrieve the page numbers during the querying stage.

Here's an example of how you can achieve this:

from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
from llama_index.core.node_parser import LlamaParseJsonNodeParser
from llama_index.legacy.schema import TextNode

# Step 1: Load your document and create nodes with page number metadata
documents = LlamaParse(result_type="markdown").load_data("uber2k.pdf")

nodes = []
for doc in documents:
    for page_num, page_content in enumerate(doc.pages):
        node_parser = LlamaParseJsonNodeParser()
        parsed_nodes = node_parser.get_nodes_from_node(TextNode(text=page_content))
        for n in parsed_nodes:
            n.metadata['page_number'] = page_num + 1  # Adding page number metadata
            nodes.append(n)

# Step 2: Index the nodes
index = VectorStoreIndex.from_documents(nodes)

# Step 3: Set up the query engine with reranker
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)
query_engine = index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

# Step 4: Query the index
query = "Explain about uber?"
response = query_engine.query(query)

# Step 5: Stream response with page citation
response.print_response_stream()

# Inspect source nodes to include page numbers in the response
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")

In this example:

This approach ensures that the page number is part of the metadata and can be returned as part of the response in your RAG setup [1][2][3].

To continue talking to Dosu, mention @dosu.

tush05tgsingh commented 1 week ago

@dosubot I am also working on the same issue. But I am using documents = SimpleDirectoryReader(path, file_extractor=file_extractor).load_data(), the documents doesn't get page number in its metadata.