[PDFMardownReader] LlamaIndex Reader

YanSte commented 6 months ago

Title

PDF Reader with Markdown Feature for LlamaIndex

Description

This pull request introduces a new feature a PDF Markdown reader with LlamaIndex. The new PDFMardownReader class extends the BaseReader class from LlamaIndex and utilizes the PyMuPDF library to read PDF files and convert them to Markdown format.

Key Features

Introduces a new PDFMardownReader class that reads PDF files using the PyMuPDF library.
Supports both synchronous and asynchronous loading of PDF documents.
Allows for the inclusion and filtering of metadata from the PDF files.
Converts each page of the PDF file to a separate Markdown document.
Raises appropriate errors for incorrect input types.

Installation

Please ensure you have 'llama_index' version.

Testing

Added testing of loading PDF.

Possible Improvements

Need implementing Testing Pytest in the project. 👈
Improve the import check for LlamaIndex.
Add maybe dependency management if include LlamaIndex.
Add more comprehensive tests for the new feature.
Improve error handling and validation checks.

Please review the code and let me know if there are any suggestions or improvements. I'm looking forward to your feedback!

Dependencies

PyMuPDF
llama_index

YanSte commented 6 months ago

Hello @JorjMcKie,

I added some tests to the project. Please be aware that Pytest is now a requirement for these to function correctly.

Additionally, I've encountered a few issues regarding the import process. I believe a type check import method could potentially resolve these errors.

JorjMcKie commented 6 months ago

Hi @YanSte Thank you so much for this! I am going to merge it now. I am embarrassed to ask for an additional favor: could you please confirm that you accept Artifex' CLA as mentioned?

A few more comments:

Is there a reason to call the markdown reader PDFMardownReader instead of PDFMarkdownReader? Just a typo I assume 😉?
I will make a few minor adaptions. Among them are changing the import statement for PyMuPDF. Since the latest version 1.24.3 published this week, we now can do import pymupdf. The old import of "fitz" remains supported though.

YanSte commented 6 months ago

Hi @JorjMcKie

You are welcome.

Regarding the Artifex CLA, I confirm acceptance and I have already sent an email.

As for the markdown reader, you're absolutely correct—it was indeed a typo. It should be PDFMarkdownReader. 😄

Perfect 👍🏾

Pleasure of sharing.

JorjMcKie commented 6 months ago

Other questions:

~1. Struggling to understand the use of "extra_info". I suppose pymupdf4llm should take precautions to understand / react to whatever keys are present in that dictionary, right?~

PyMuPDF does not support Python threading yet. I would therefore like to replace the asynchronous logic with using Python's multiprocessing module - which has better performance anyway. Do you see issues here?

Information only: For adequate identification of text headers, the markdown extraction performs a document-wide fast analysis before any page text extraction. As you are interested in separate Llama text documents per page, I have separated that header information scan from the to_markdown() method. The header info scan is now executed independently right after open, and its result is passed as a parameter to the to_markdown() method. This will ensure executing the scan only once per document.

YanSte commented 6 months ago

Hi @JorjMcKie

My question is will the parse by page functionality still be available? (If important for the metadata index of page)

And if yes, could we potentially introduce threading after the initial document parse to enable separate page parsing with thread maybe ?

YanSte commented 6 months ago

Could you just give more details on why you removed the threading page?

I tried to understand.

JorjMcKie commented 6 months ago

Could you just give more details on why you removed the threading page?

I tried to understand.

PyMuPDF currently struggles with supporting Python's threading logic. The reason is MuPDF's internally maintained caching of large, frequently used objects. This cache (called "context area") is maintained inside MuPDF and its handling is not thread-safe. Multiple threads therefore would have to be supported by allocating a separate context area for each thread. This is possible technically (i.e. MuPDF does support context area duplication), but not yet implemented in PyMuPDF.

In contrast to threads, PyMuPDF has no problem with multiprocessing as mentioned. It can be much faster than threading (parallelism is handled by the operating system and not by the inferior Python threading). But multiprocessing comes with a few other restrictions like elegantly joining the results of the processes and the requirement that the separate processes must be started from the process in "__main__".

My question is will the parse by page functionality still be available? (If important for the metadata index of page)

That in itself is no problem at all. I have modified the to_markdown() method accordingly. With this, page_list = MDReader.load_data(filename) is a list of List[LlamaIndexDocument], where MDReader is a PDFMarkdownReader object. The performance is the same as that of original to_markdown() method.

YanSte commented 6 months ago

Perfect, thank you for your response.

pymupdf / RAG