Page header and footer detection is wrong

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

539 stars 82 forks source link

Page header and footer detection is wrong #145

Closed bbfrog closed 2 months ago

bbfrog commented 2 months ago

Here is the example pdf for this problem: AAD_ESK001.pdf

From page 2, first sentence of the title is dropped. I think it maybe detected as page header? And in page 5, the last line of table is dropped, maybe it is detected as page footer?

Please let me know whether it is easy to fix? Or have an option to keep the page header and footer? Thanks very much in advance!

JorjMcKie commented 2 months ago

This script works and considers the full page:

import pymupdf4llm
import sys
import pathlib

filename = sys.argv[1]

md = pymupdf4llm.to_markdown(filename, margins=0)
pathlib.Path(filename + ".md").write_bytes(md.encode())

Please be aware that the default is margins=(0, 50, 0, 50). Disregarding those two stripes of height 50 may not always desirable, so choose your own values if in doubt.

bbfrog commented 2 months ago

thanks very much!