[Question]: How to handle complex PDFs

cxycxm commented 1 month ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

How to handle complex PDFs，such as PDFs with images, tables, etc.

dosubot[bot] commented 1 month ago

To handle complex PDFs that contain images, tables, and other intricate elements, you can use the Smart PDF Loader provided by LlamaIndex. This loader processes PDFs by understanding their layout structure, such as nested sections, lists, paragraphs, and tables, and smartly chunks them into optimal short contexts for LLMs. Here is how you can use it:

Install the Smart PDF Loader:

pip install llama-index-readers-smart-pdf-loader

Ensure you have the llmsherpa library installed:
```
pip install llmsherpa
```

Use the following code to load your PDF documents:

from llama_index.readers.smart_pdf_loader import SmartPDFLoader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"  # You can also use a file path, e.g., /home/downloads/xyz.pdf
pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
documents = pdf_loader.load_data(pdf_url)

You can then use the loaded documents with other LlamaIndex components. For example, for retrieval augmented generation:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("list all the tasks that work with bart")
print(response)

response = query_engine.query("what is the bart performance score on squad")
print(response)

Additionally, for documents with hierarchical relationships, you can use the RecursiveRetriever module. This module not only explores the directly most relevant nodes but also explores node relationships to additional retrievers/query engines and executes them. For instance, a node may represent a concise summary of a structured table and link to a SQL/Pandas query engine over that structured table. Here is an example of how to use it:

Install the necessary libraries:

pip install llama-index-embeddings-openai
pip install llama-index-readers-file pymupdf
pip install llama-index-llms-openai
pip install llama-index-experimental

Use the following code to load your PDF and extract tables using camelot:

import camelot
from llama_index.core import VectorStoreIndex
from llama_index.experimental.query_engine import PandasQueryEngine
from llama_index.core.schema import IndexNode
from llama_index.llms.openai import OpenAI
from llama_index.readers.file import PyMuPDFReader
from typing import List

file_path = "billionaires_page.pdf"
reader = PyMuPDFReader()
docs = reader.load(file_path)

def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = (
            table_df.rename(columns=table_df.iloc[0])
            .drop(table_df.index[0])
            .reset_index(drop=True)
        )
        table_dfs.append(table_df)
    return table_dfs

table_dfs = get_tables(file_path, pages=[3, 25])

Create Pandas Query Engines for the extracted tables:

llm = OpenAI(model="gpt-4")
df_query_engines = [PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs]

Build a vector index over the chunked document and additional IndexNode objects linked to the tables:

from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)
summaries = [
    "This node provides information about the world's richest billionaires in 2023",
    "This node provides information on the number of billionaires and their combined net worth from 2000 to 2023."
]

df_nodes = [IndexNode(text=summary, index_id=f"pandas{idx}") for idx, summary in enumerate(summaries)]
df_id_query_engine_mapping = {f"pandas{idx}": df_query_engine for idx, df_query_engine in enumerate(df_query_engines)}

vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

Use RecursiveRetriever in your RetrieverQueryEngine:

from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")
query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)

response = query_engine.query("What's the net worth of the second richest billionaire in 2023?")
print(str(response))

Alternatively, you can use the Upstage Layout Analysis Reader for more advanced layout analysis. This reader can handle complex PDFs by extracting text, images, tables, and other elements using OCR if needed. Here is how you can use it:

Install the necessary libraries:
```
pip install llama-index-readers-upstage
```

Use the following code to load your PDF documents:

from llama_index.readers.upstage.base import UpstageLayoutAnalysisReader

reader = UpstageLayoutAnalysisReader(api_key="your_upstage_api_key", use_ocr=True)
docs = reader.load_data("path/to/file.pdf")

This approach ensures that the documents are broken down into a hierarchical semantic structure, providing a meaningful representation of the document's content, including headings, paragraphs, lists, tables, and other elements ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

cxycxm commented 1 month ago

https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all What's this？Can I process complex pdf locally?

run-llama / llama_index

[Question]: How to handle complex PDFs #15243

Question Validation

Question