Closed bbfrog closed 1 month ago
This script works and considers the full page:
import pymupdf4llm
import sys
import pathlib
filename = sys.argv[1]
md = pymupdf4llm.to_markdown(filename, margins=0)
pathlib.Path(filename + ".md").write_bytes(md.encode())
Please be aware that the default is margins=(0, 50, 0, 50)
. Disregarding those two stripes of height 50 may not always desirable, so choose your own values if in doubt.
thanks very much!
Here is the example pdf for this problem: AAD_ESK001.pdf
From page 2, first sentence of the title is dropped. I think it maybe detected as page header? And in page 5, the last line of table is dropped, maybe it is detected as page footer?
Please let me know whether it is easy to fix? Or have an option to keep the page header and footer? Thanks very much in advance!