pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

[PDFMardownReader] LlamaIndex Reader #7

Closed YanSte closed 6 months ago

YanSte commented 6 months ago

Title

PDF Reader with Markdown Feature for LlamaIndex

Description

This pull request introduces a new feature a PDF Markdown reader with LlamaIndex. The new PDFMardownReader class extends the BaseReader class from LlamaIndex and utilizes the PyMuPDF library to read PDF files and convert them to Markdown format.

Key Features

Installation

Please ensure you have 'llama_index' version.

Testing

Added testing of loading PDF.

Possible Improvements

Please review the code and let me know if there are any suggestions or improvements. I'm looking forward to your feedback!

Dependencies

YanSte commented 6 months ago

Hello @JorjMcKie,

I added some tests to the project. Please be aware that Pytest is now a requirement for these to function correctly.

Additionally, I've encountered a few issues regarding the import process. I believe a type check import method could potentially resolve these errors.

JorjMcKie commented 6 months ago

Hi @YanSte Thank you so much for this! I am going to merge it now. I am embarrassed to ask for an additional favor: could you please confirm that you accept Artifex' CLA as mentioned?

A few more comments:

YanSte commented 6 months ago

Hi @JorjMcKie

You are welcome.

Regarding the Artifex CLA, I confirm acceptance and I have already sent an email.

As for the markdown reader, you're absolutely correct—it was indeed a typo. It should be PDFMarkdownReader. 😄

Perfect 👍🏾

Pleasure of sharing.

JorjMcKie commented 6 months ago

Other questions:

~1. Struggling to understand the use of "extra_info". I suppose pymupdf4llm should take precautions to understand / react to whatever keys are present in that dictionary, right?~

  1. PyMuPDF does not support Python threading yet. I would therefore like to replace the asynchronous logic with using Python's multiprocessing module - which has better performance anyway. Do you see issues here?

Information only: For adequate identification of text headers, the markdown extraction performs a document-wide fast analysis before any page text extraction. As you are interested in separate Llama text documents per page, I have separated that header information scan from the to_markdown() method. The header info scan is now executed independently right after open, and its result is passed as a parameter to the to_markdown() method. This will ensure executing the scan only once per document.

YanSte commented 6 months ago

Hi @JorjMcKie

My question is will the parse by page functionality still be available? (If important for the metadata index of page)

And if yes, could we potentially introduce threading after the initial document parse to enable separate page parsing with thread maybe ?

YanSte commented 6 months ago

Could you just give more details on why you removed the threading page?

I tried to understand.

JorjMcKie commented 6 months ago

Could you just give more details on why you removed the threading page?

I tried to understand.

PyMuPDF currently struggles with supporting Python's threading logic. The reason is MuPDF's internally maintained caching of large, frequently used objects. This cache (called "context area") is maintained inside MuPDF and its handling is not thread-safe. Multiple threads therefore would have to be supported by allocating a separate context area for each thread. This is possible technically (i.e. MuPDF does support context area duplication), but not yet implemented in PyMuPDF.

In contrast to threads, PyMuPDF has no problem with multiprocessing as mentioned. It can be much faster than threading (parallelism is handled by the operating system and not by the inferior Python threading). But multiprocessing comes with a few other restrictions like elegantly joining the results of the processes and the requirement that the separate processes must be started from the process in "__main__".


My question is will the parse by page functionality still be available? (If important for the metadata index of page)

That in itself is no problem at all. I have modified the to_markdown() method accordingly. With this, page_list = MDReader.load_data(filename) is a list of List[LlamaIndexDocument], where MDReader is a PDFMarkdownReader object. The performance is the same as that of original to_markdown() method.

YanSte commented 6 months ago

Perfect, thank you for your response.