Closed hherb closed 3 months ago
Thank you for the example file.
The problem is caused by the last handful of pages. They contain a tremendous amount of vector graphics. Especially page 44 (0-based) - there are over 50,000 drawings on this page, which leads to extremely long response times in the effort to cluster the commands to larger graphics.
Even when assuming that this could be done fast enough, the gain in terms of text content insight would be close to zero. So I suggest to find a threshold beyond which the page should be ignored entirely or maybe replaced by its image.
If you change your extraction script to the following, you should avoid this type of problem at least for the time being:
import pymupdf, pymupdf4llm, pathlib
filename = "test.pdf"
doc = pymupdf.open(filename)
# create the header font size info as a separate step
hdr = pymupdf4llm.IdentifyHeaders(doc)
text = ""
for page in doc:
paths = page.get_cdrawings() # fast extraction of vector graphics
if len(paths) > 5000: # skip page with too many graphics
print(f"Omitted {page.number=}")
continue
md_text = pymupdf4llm.to_markdown(doc, pages=[page.number], hdr_info=hdr)
text += md_text
pathlib.Path("test.md").write_bytes(text.encode())
So this extracts each single page using the pages
parameter. To avoid the effort computing the header font size info on each invocation, we are doing it once and pass that information to the single executions.
Thank you!
I am converting a very large number of PDF documents. Amazing speed and quality, thanks guys!
However, every 20-100 documents, pymupdf4llm just hangs - no error code, no crash, just does not continue (even after several hours just hanging). I am using version 0.0.3 via "pip install -U pymudpf4llm"
logging.info(f"Calling pymupdf4llm with {pdf_filename} to markdown") markdown_text = pymupdf4llm.to_markdown(pdf_filename) ---> never gets to here logging.info(f"Finished converting {pdf_filename} to markdown")
error is reproducible with e.g. the follwing pdf file: https://www.medrxiv.org/content/10.1101/2020.07.16.20153437v1.full.pdf but happens with literally hundreds of such pdfs whcih all open aand display without any issues in any pdf viewer I tried. Sadly I know nothing about PDFs and how to address this issue.
This is the output when I interrupt the hanging process: