pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
303 stars 57 forks source link

Ignore the header and footer of PDF #6

Closed difonjohaiv closed 4 months ago

difonjohaiv commented 4 months ago

Thanks for your great open-source work. Pymupdf4llm helped me a lot.

I found that converting PDF files to markdown files will include headers and footers in the generated markdown files. However, headers and footers are sometimes irrelevant noise information.

Do you have any plans to set parameters for to_markdown to specify whether to retain the header and footer of the original PDF in the generated markdown file?

JorjMcKie commented 4 months ago

Thank you for your appreciation!

The methods behind .to_markdown use standard PyMuPDF functions for text and vector graphics extraction. Additional logic on top of these includes bringing table and non-table rectangles in the right sequence - plus some code to generate MD output formatting of course.

This part of the code is fairly new - I am aware of enhancement potential already and would welcome suggestions. What I think would not make sense is a full-blown document-wide page analysis before starting the actual conversion.

I am sure you are aware that the PDF specification knows nothing about things like header or footer (well, at least these specifications are ignored by 99.9% of all PDF creators). It's all just text.

What we could do is offering a "margin" parameter or similar, maybe top=72, bot=36 to ignore 1 inch at the top and 0.5 inch at the bottom of each page.

What do you think?

difonjohaiv commented 4 months ago

Thank you for your appreciation!

The methods behind .to_markdown use standard PyMuPDF functions for text and vector graphics extraction. Additional logic on top of these includes bringing table and non-table rectangles in the right sequence - plus some code to generate MD output formatting of course.

This part of the code is fairly new - I am aware of enhancement potential already and would welcome suggestions. What I think would not make sense is a full-blown document-wide page analysis before starting the actual conversion.

I am sure you are aware that the PDF specification knows nothing about things like header or footer (well, at least these specifications are ignored by 99.9% of all PDF creators). It's all just text.

What we could do is offering a "margin" parameter or similar, maybe top=72, bot=36 to ignore 1 inch at the top and 0.5 inch at the bottom of each page.

What do you think?

Thank you so much for sharing these details about the to_markdown method.

I think your idea of ignoring the header and footer areas by providing a "margin" parameter is a good solution, if it does not cause extra processing time.

Looking forward to your amazing work!