Open cxycxm opened 1 month ago
To handle complex PDFs that contain images, tables, and other intricate elements, you can use the Smart PDF Loader provided by LlamaIndex. This loader processes PDFs by understanding their layout structure, such as nested sections, lists, paragraphs, and tables, and smartly chunks them into optimal short contexts for LLMs. Here is how you can use it:
Install the Smart PDF Loader:
pip install llama-index-readers-smart-pdf-loader
Ensure you have the llmsherpa
library installed:
pip install llmsherpa
Use the following code to load your PDF documents:
from llama_index.readers.smart_pdf_loader import SmartPDFLoader
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # You can also use a file path, e.g., /home/downloads/xyz.pdf
pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
documents = pdf_loader.load_data(pdf_url)
You can then use the loaded documents with other LlamaIndex components. For example, for retrieval augmented generation:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("list all the tasks that work with bart")
print(response)
response = query_engine.query("what is the bart performance score on squad")
print(response)
Additionally, for documents with hierarchical relationships, you can use the RecursiveRetriever
module. This module not only explores the directly most relevant nodes but also explores node relationships to additional retrievers/query engines and executes them. For instance, a node may represent a concise summary of a structured table and link to a SQL/Pandas query engine over that structured table. Here is an example of how to use it:
Install the necessary libraries:
pip install llama-index-embeddings-openai
pip install llama-index-readers-file pymupdf
pip install llama-index-llms-openai
pip install llama-index-experimental
Use the following code to load your PDF and extract tables using camelot
:
import camelot
from llama_index.core import VectorStoreIndex
from llama_index.experimental.query_engine import PandasQueryEngine
from llama_index.core.schema import IndexNode
from llama_index.llms.openai import OpenAI
from llama_index.readers.file import PyMuPDFReader
from typing import List
file_path = "billionaires_page.pdf"
reader = PyMuPDFReader()
docs = reader.load(file_path)
def get_tables(path: str, pages: List[int]):
table_dfs = []
for page in pages:
table_list = camelot.read_pdf(path, pages=str(page))
table_df = table_list[0].df
table_df = (
table_df.rename(columns=table_df.iloc[0])
.drop(table_df.index[0])
.reset_index(drop=True)
)
table_dfs.append(table_df)
return table_dfs
table_dfs = get_tables(file_path, pages=[3, 25])
Create Pandas Query Engines for the extracted tables:
llm = OpenAI(model="gpt-4")
df_query_engines = [PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs]
Build a vector index over the chunked document and additional IndexNode
objects linked to the tables:
from llama_index.core import Settings
doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)
summaries = [
"This node provides information about the world's richest billionaires in 2023",
"This node provides information on the number of billionaires and their combined net worth from 2000 to 2023."
]
df_nodes = [IndexNode(text=summary, index_id=f"pandas{idx}") for idx, summary in enumerate(summaries)]
df_id_query_engine_mapping = {f"pandas{idx}": df_query_engine for idx, df_query_engine in enumerate(df_query_engines)}
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)
Use RecursiveRetriever
in your RetrieverQueryEngine
:
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
recursive_retriever = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever},
query_engine_dict=df_id_query_engine_mapping,
verbose=True,
)
response_synthesizer = get_response_synthesizer(response_mode="compact")
query_engine = RetrieverQueryEngine.from_args(
recursive_retriever, response_synthesizer=response_synthesizer
)
response = query_engine.query("What's the net worth of the second richest billionaire in 2023?")
print(str(response))
Alternatively, you can use the Upstage Layout Analysis Reader for more advanced layout analysis. This reader can handle complex PDFs by extracting text, images, tables, and other elements using OCR if needed. Here is how you can use it:
Install the necessary libraries:
pip install llama-index-readers-upstage
Use the following code to load your PDF documents:
from llama_index.readers.upstage.base import UpstageLayoutAnalysisReader
reader = UpstageLayoutAnalysisReader(api_key="your_upstage_api_key", use_ocr=True)
docs = reader.load_data("path/to/file.pdf")
This approach ensures that the documents are broken down into a hierarchical semantic structure, providing a meaningful representation of the document's content, including headings, paragraphs, lists, tables, and other elements [1][2][3][4].
To continue talking to Dosu, mention @dosu.
https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all What's this?Can I process complex pdf locally?
Question Validation
Question
How to handle complex PDFs,such as PDFs with images, tables, etc.