Open arunnuve opened 3 months ago
To return the page number as a source in your retrieval-augmented generation (RAG) setup, you need to ensure that the metadata for each Node
includes the page number. This metadata can then be used during the querying stage to provide the page number along with the retrieved context.
Here's how you can modify your code to include page numbers:
Loading Stage: Ensure that each Node
includes metadata for the page number when loading your documents.
Indexing Stage: Index the documents with the metadata.
Querying Stage: Retrieve the nodes and include the page number in the response.
Here is an example of how you can achieve this:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
# Step 1: Load your document and create nodes with page number metadata
reader = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"])
data = reader.load_data()
# Assuming your documents have page numbers in their metadata
for doc in data:
for page_num, page_content in enumerate(doc.pages):
node = MarkdownElementNodeParser(
llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
).get_nodes_from_documents([page_content])
for n in node:
n.metadata['page_number'] = page_num + 1 # Adding page number metadata
# Step 2: Index the nodes
index = VectorStoreIndex.from_documents(data)
# Step 3: Set up the query engine with reranker
reranker = FlagEmbeddingReranker(
top_n=5,
model="BAAI/bge-reranker-large",
)
query_engine = index.as_query_engine(
similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)
# Step 4: Query the index
query = "Explain about uber?"
response = query_engine.query(query)
# Step 5: Stream response with page citation
response.print_response_stream()
# Inspect source nodes to include page numbers in the response
for node in response.source_nodes:
print("-----")
text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
print(f"Text:\t {text_fmt} ...")
print(f"Metadata:\t {node.node.metadata}")
print(f"Score:\t {node.score:.3f}")
In this example:
Node
is created with metadata that includes the page number.VectorStoreIndex
indexes these nodes.QueryEngine
retrieves the relevant nodes based on the query.This approach ensures that the page number is part of the metadata and can be returned as part of the response in your RAG setup [1][2].
To continue talking to Dosu, mention @dosu.
@dosu this how iam loading my documents iam not using SimpleDirectoryReader instead iam using llamaparse,Here is snippet of code documents = LlamaParse(result_type="markdown").load_data("uber2k.pdf")
To include page numbers in the metadata when using LlamaParse
, you can modify your code to ensure that each Node
includes the page number. Here's how you can do it:
TextNode
creation to include page number metadata.Here's an example of how you can achieve this:
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
from llama_index.core.node_parser import LlamaParseJsonNodeParser
from llama_index.legacy.schema import TextNode
# Step 1: Load your document and create nodes with page number metadata
documents = LlamaParse(result_type="markdown").load_data("uber2k.pdf")
nodes = []
for doc in documents:
for page_num, page_content in enumerate(doc.pages):
node_parser = LlamaParseJsonNodeParser()
parsed_nodes = node_parser.get_nodes_from_node(TextNode(text=page_content))
for n in parsed_nodes:
n.metadata['page_number'] = page_num + 1 # Adding page number metadata
nodes.append(n)
# Step 2: Index the nodes
index = VectorStoreIndex.from_documents(nodes)
# Step 3: Set up the query engine with reranker
reranker = FlagEmbeddingReranker(
top_n=5,
model="BAAI/bge-reranker-large",
)
query_engine = index.as_query_engine(
similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)
# Step 4: Query the index
query = "Explain about uber?"
response = query_engine.query(query)
# Step 5: Stream response with page citation
response.print_response_stream()
# Inspect source nodes to include page numbers in the response
for node in response.source_nodes:
print("-----")
text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
print(f"Text:\t {text_fmt} ...")
print(f"Metadata:\t {node.node.metadata}")
print(f"Score:\t {node.score:.3f}")
In this example:
TextNode
is created with metadata that includes the page number.VectorStoreIndex
indexes these nodes.QueryEngine
retrieves the relevant nodes based on the query.This approach ensures that the page number is part of the metadata and can be returned as part of the response in your RAG setup [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosubot I am also working on the same issue. But I am using documents = SimpleDirectoryReader(path, file_extractor=file_extractor).load_data(), the documents doesn't get page number in its metadata.
Question Validation
Question
This is my code: from llama_index.core.node_parser import MarkdownElementNodeParser node_parser = MarkdownElementNodeParser( llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8 ) nodes = node_parser.get_nodes_from_documents(documents) base_nodes, objects = node_parser.get_nodes_and_objects(nodes) recursive_index = VectorStoreIndex(nodes=base_nodes + objects) from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
reranker = FlagEmbeddingReranker( top_n=5, model="BAAI/bge-reranker-large", )
recursive_query_engine = recursive_index.as_query_engine( similarity_top_k=5, node_postprocessors=[reranker], verbose=True ) query = "Explain about uber?"
response_2 = recursive_query_engine.query(query) Now it gives only response i need to get the page number where it is retreived from