Closed YanSte closed 6 months ago
Hello @JorjMcKie,
I added some tests to the project. Please be aware that Pytest is now a requirement for these to function correctly.
Additionally, I've encountered a few issues regarding the import process. I believe a type check import method could potentially resolve these errors.
Hi @YanSte Thank you so much for this! I am going to merge it now. I am embarrassed to ask for an additional favor: could you please confirm that you accept Artifex' CLA as mentioned?
A few more comments:
PDFMardownReader
instead of PDFMarkdownReader
? Just a typo I assume 😉?import pymupdf
. The old import of "fitz" remains supported though.Hi @JorjMcKie
You are welcome.
Regarding the Artifex CLA, I confirm acceptance and I have already sent an email.
As for the markdown reader, you're absolutely correct—it was indeed a typo. It should be PDFMarkdownReader. 😄
Perfect 👍🏾
Pleasure of sharing.
Other questions:
~1. Struggling to understand the use of "extra_info". I suppose pymupdf4llm should take precautions to understand / react to whatever keys are present in that dictionary, right?~
multiprocessing
module - which has better performance anyway. Do you see issues here?Information only: For adequate identification of text headers, the markdown extraction performs a document-wide fast analysis before any page text extraction. As you are interested in separate Llama text documents per page, I have separated that header information scan from the
to_markdown()
method. The header info scan is now executed independently right after open, and its result is passed as a parameter to theto_markdown()
method. This will ensure executing the scan only once per document.
Hi @JorjMcKie
My question is will the parse by page functionality still be available? (If important for the metadata index of page)
And if yes, could we potentially introduce threading after the initial document parse to enable separate page parsing with thread maybe ?
Could you just give more details on why you removed the threading page?
I tried to understand.
Could you just give more details on why you removed the threading page?
I tried to understand.
PyMuPDF currently struggles with supporting Python's threading logic. The reason is MuPDF's internally maintained caching of large, frequently used objects. This cache (called "context area") is maintained inside MuPDF and its handling is not thread-safe. Multiple threads therefore would have to be supported by allocating a separate context area for each thread. This is possible technically (i.e. MuPDF does support context area duplication), but not yet implemented in PyMuPDF.
In contrast to threads, PyMuPDF has no problem with multiprocessing as mentioned. It can be much faster than threading (parallelism is handled by the operating system and not by the inferior Python threading). But multiprocessing comes with a few other restrictions like elegantly joining the results of the processes and the requirement that the separate processes must be started from the process in "__main__"
.
My question is will the parse by page functionality still be available? (If important for the metadata index of page)
That in itself is no problem at all. I have modified the to_markdown()
method accordingly.
With this, page_list = MDReader.load_data(filename)
is a list of List[LlamaIndexDocument]
, where MDReader
is a PDFMarkdownReader
object.
The performance is the same as that of original to_markdown()
method.
Perfect, thank you for your response.
Title
PDF Reader with Markdown Feature for LlamaIndex
Description
This pull request introduces a new feature a PDF Markdown reader with LlamaIndex. The new
PDFMardownReader
class extends theBaseReader
class from LlamaIndex and utilizes the PyMuPDF library to read PDF files and convert them to Markdown format.Key Features
PDFMardownReader
class that reads PDF files using the PyMuPDF library.Installation
Please ensure you have 'llama_index' version.
Testing
Added testing of loading PDF.
Possible Improvements
Please review the code and let me know if there are any suggestions or improvements. I'm looking forward to your feedback!
Dependencies