[Suggestion] PDFReader with LlamaIndex BaseReader and insertion in Llama Hub

YanSte commented 6 months ago

Hi there,

I would like to suggest this PDFReader combined with BaseReader from LlamaIndex.

The PDFReader class will allow users to read PDF files using the PyMuPDF library with LlamaIndex and your Markdown parser.

The PDFReader class includes the following features:

Loads a list of documents from a PDF file and accepts extra information in a dictionary format.
Option to use markdown format for the text of the documents.
Option to include metadata such as total number of pages and file path.
Asynchronous loading of documents using asyncio.

Here is the Prototype code:

import asyncio
from pathlib import Path
from typing import Any, Callable, Dict, List, Optional, Union

import fitz
import pymupdf4llm
from fitz import Document as FitzDocument
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document as LlamaIndexDocument
from pydantic.v1 import BaseModel

class PDFReader(BaseModel, BaseReader):
    """Read PDF files using PyMuPDF library."""

    use_format_markdown: bool = True
    use_meta: bool = True
    parse_metadata: Optional[Callable[[Dict[str, Any]], Dict[str, Any]]] = None

    def load_data(
        self,
        file_path: Union[Path, str],
        extra_info: Optional[Dict] = None,
        **load_kwargs: Any,
    ) -> List[LlamaIndexDocument]:
        """Loads list of documents from PDF file and also accepts extra information in dict format.

        Args:
            file_path (Union[Path, str]): The path to the PDF file.
            extra_info (Optional[Dict], optional): A dictionary containing extra information. Defaults to None.
            **load_kwargs (Any): Additional keyword arguments to be passed to the load method.

        Returns:
            List[LlamaIndexDocument]: A list of LlamaIndexDocument objects.
        """
        if not isinstance(file_path, str) and not isinstance(file_path, Path):
            raise TypeError("file_path must be a string or Path.")

        if not extra_info:
            extra_info = {}

        if extra_info and not isinstance(extra_info, dict):
            raise TypeError("extra_info must be a dictionary.")

        doc: FitzDocument = fitz.open(file_path)

        if self.use_format_markdown:
            docs = []
            for page in doc:
                docs.append(
                    self._process_doc_page(doc, extra_info, file_path, page.number)
                )
            return docs
        else:
            return doc.get_text().encode("utf-8")

    async def aload_data(
        self,
        file_path: Union[Path, str],
        extra_info: Optional[Dict] = None,
        **load_kwargs: Any,
    ) -> List[LlamaIndexDocument]:
        """Asynchronously loads list of documents from PDF file and also accepts extra information in dict format.

        Args:
            file_path (Union[Path, str]): The path to the PDF file.
            extra_info (Optional[Dict], optional): A dictionary containing extra information. Defaults to None.
            **load_kwargs (Any): Additional keyword arguments to be passed to the load method.

        Returns:
            List[LlamaIndexDocument]: A list of LlamaIndexDocument objects.
        """
        if not isinstance(file_path, str) and not isinstance(file_path, Path):
            raise TypeError("file_path must be a string or Path.")

        if not extra_info:
            extra_info = {}

        if extra_info and not isinstance(extra_info, dict):
            raise TypeError("extra_info must be a dictionary.")

        doc: FitzDocument = fitz.open(file_path)

        if self.use_format_markdown:
            tasks = []
            for page in doc:
                tasks.append(
                    self._aprocess_doc_page(doc, extra_info, file_path, page.number)
                )
            return await asyncio.gather(*tasks)
        else:
            return doc.get_text().encode("utf-8")

    # Helpers
    # ---
    async def _aprocess_doc_page(
        self,
        doc: FitzDocument,
        extra_info: Dict[str, Any],
        file_path: str,
        page_number: int,
    ):
        """Asynchronously processes a single page of a PDF document."""
        return self._process_doc_page(doc, extra_info, file_path, page_number)

    def _process_doc_page(
        self,
        doc: FitzDocument,
        extra_info: Dict[str, Any],
        file_path: str,
        page_number: int,
    ):
        """Processes a single page of a PDF document."""
        if self.use_meta:
            extra_info = self._process_meta(doc, file_path, page_number, extra_info)

        text = pymupdf4llm.to_markdown(doc, [page_number])
        return LlamaIndexDocument(text=text, extra_info=extra_info)

    def _process_meta(
        self,
        doc: FitzDocument,
        file_path: Union[Path, str],
        page_number: int,
        extra_info: Optional[Dict] = None,
    ):
        """Processes metas of a PDF document."""
        extra_info.update(doc.metadata)
        extra_info["page_number"] = f"{page_number+1}"
        extra_info["total_pages"] = len(doc)
        extra_info["file_path"] = str(file_path)

        self._clean_dict_in_place(extra_info)

        if self.parse_metadata:
            self.parse_metadata(extra_info)

        return extra_info

    def _clean_dict_in_place(self, d: Dict[str, Any]) -> Dict[str, Any]:
        for k in list(d.keys()):
            if d[k] is None or d[k] == "":
                del d[k]

If you are interested in including this new reader with compatible LlamaIndex, I would be happy to create a pull request with the necessary changes.

Additionally, if you would like to include this reader in LlamaHub, I can provide any necessary assistance to make that happen. https://llamahub.ai/

Let me know what you think.

PS: Thank you for your work, I have very good results/Score with my evaluations. ✌️

JorjMcKie commented 6 months ago

Thank you very much for this suggestion and your favorable feedback indeed!

We assure you that we will start working on this now!

JorjMcKie commented 6 months ago

@YanSte - I have had a look at your prototype code and would to come back to your friendly offer to submit a PR with this.

We will turn full attention to it. When you do this may I also ask to add a comment that you have read and accept Artifex' CLA (contribution license agreement)? Link: https://artifex.com/contributor/

Thank you very much!

YanSte commented 6 months ago

Hey @JorjMcKie , thank you.

I'll start a pull request soon and make sure I've reviewed the contribution license agreement.

YanSte commented 6 months ago

Done ✅ > https://github.com/pymupdf/RAG/pull/7

tahitimoon commented 6 days ago

There is a bug here, the load_kwargs parameter is not used next, and it should be passed to the to_markdown function.

pymupdf / RAG

[Suggestion] PDFReader with LlamaIndex BaseReader and insertion in Llama Hub #4