run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.39k stars 4.98k forks source link

[Question]: How to handle complex PDFs #15243

Open cxycxm opened 1 month ago

cxycxm commented 1 month ago

Question Validation

Question

How to handle complex PDFs,such as PDFs with images, tables, etc.

dosubot[bot] commented 1 month ago

To handle complex PDFs that contain images, tables, and other intricate elements, you can use the Smart PDF Loader provided by LlamaIndex. This loader processes PDFs by understanding their layout structure, such as nested sections, lists, paragraphs, and tables, and smartly chunks them into optimal short contexts for LLMs. Here is how you can use it:

  1. Install the Smart PDF Loader:

    pip install llama-index-readers-smart-pdf-loader
  2. Ensure you have the llmsherpa library installed:

    pip install llmsherpa
  3. Use the following code to load your PDF documents:

    from llama_index.readers.smart_pdf_loader import SmartPDFLoader
    
    llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
    pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"  # You can also use a file path, e.g., /home/downloads/xyz.pdf
    pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
    documents = pdf_loader.load_data(pdf_url)
  4. You can then use the loaded documents with other LlamaIndex components. For example, for retrieval augmented generation:

    from llama_index.core import VectorStoreIndex
    
    index = VectorStoreIndex.from_documents(documents)
    query_engine = index.as_query_engine()
    
    response = query_engine.query("list all the tasks that work with bart")
    print(response)
    
    response = query_engine.query("what is the bart performance score on squad")
    print(response)

Additionally, for documents with hierarchical relationships, you can use the RecursiveRetriever module. This module not only explores the directly most relevant nodes but also explores node relationships to additional retrievers/query engines and executes them. For instance, a node may represent a concise summary of a structured table and link to a SQL/Pandas query engine over that structured table. Here is an example of how to use it:

  1. Install the necessary libraries:

    pip install llama-index-embeddings-openai
    pip install llama-index-readers-file pymupdf
    pip install llama-index-llms-openai
    pip install llama-index-experimental
  2. Use the following code to load your PDF and extract tables using camelot:

    import camelot
    from llama_index.core import VectorStoreIndex
    from llama_index.experimental.query_engine import PandasQueryEngine
    from llama_index.core.schema import IndexNode
    from llama_index.llms.openai import OpenAI
    from llama_index.readers.file import PyMuPDFReader
    from typing import List
    
    file_path = "billionaires_page.pdf"
    reader = PyMuPDFReader()
    docs = reader.load(file_path)
    
    def get_tables(path: str, pages: List[int]):
        table_dfs = []
        for page in pages:
            table_list = camelot.read_pdf(path, pages=str(page))
            table_df = table_list[0].df
            table_df = (
                table_df.rename(columns=table_df.iloc[0])
                .drop(table_df.index[0])
                .reset_index(drop=True)
            )
            table_dfs.append(table_df)
        return table_dfs
    
    table_dfs = get_tables(file_path, pages=[3, 25])
  3. Create Pandas Query Engines for the extracted tables:

    llm = OpenAI(model="gpt-4")
    df_query_engines = [PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs]
  4. Build a vector index over the chunked document and additional IndexNode objects linked to the tables:

    from llama_index.core import Settings
    
    doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)
    summaries = [
        "This node provides information about the world's richest billionaires in 2023",
        "This node provides information on the number of billionaires and their combined net worth from 2000 to 2023."
    ]
    
    df_nodes = [IndexNode(text=summary, index_id=f"pandas{idx}") for idx, summary in enumerate(summaries)]
    df_id_query_engine_mapping = {f"pandas{idx}": df_query_engine for idx, df_query_engine in enumerate(df_query_engines)}
    
    vector_index = VectorStoreIndex(doc_nodes + df_nodes)
    vector_retriever = vector_index.as_retriever(similarity_top_k=1)
  5. Use RecursiveRetriever in your RetrieverQueryEngine:

    from llama_index.core.retrievers import RecursiveRetriever
    from llama_index.core.query_engine import RetrieverQueryEngine
    from llama_index.core import get_response_synthesizer
    
    recursive_retriever = RecursiveRetriever(
        "vector",
        retriever_dict={"vector": vector_retriever},
        query_engine_dict=df_id_query_engine_mapping,
        verbose=True,
    )
    
    response_synthesizer = get_response_synthesizer(response_mode="compact")
    query_engine = RetrieverQueryEngine.from_args(
        recursive_retriever, response_synthesizer=response_synthesizer
    )
    
    response = query_engine.query("What's the net worth of the second richest billionaire in 2023?")
    print(str(response))

Alternatively, you can use the Upstage Layout Analysis Reader for more advanced layout analysis. This reader can handle complex PDFs by extracting text, images, tables, and other elements using OCR if needed. Here is how you can use it:

  1. Install the necessary libraries:

    pip install llama-index-readers-upstage
  2. Use the following code to load your PDF documents:

    from llama_index.readers.upstage.base import UpstageLayoutAnalysisReader
    
    reader = UpstageLayoutAnalysisReader(api_key="your_upstage_api_key", use_ocr=True)
    docs = reader.load_data("path/to/file.pdf")

This approach ensures that the documents are broken down into a hierarchical semantic structure, providing a meaningful representation of the document's content, including headings, paragraphs, lists, tables, and other elements [1][2][3][4].

To continue talking to Dosu, mention @dosu.

cxycxm commented 1 month ago

https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all What's this?Can I process complex pdf locally?