pymupdg4llm. - Githubissues

hherb commented 3 months ago

I am converting a very large number of PDF documents. Amazing speed and quality, thanks guys!

However, every 20-100 documents, pymupdf4llm just hangs - no error code, no crash, just does not continue (even after several hours just hanging). I am using version 0.0.3 via "pip install -U pymudpf4llm"

logging.info(f"Calling pymupdf4llm with {pdf_filename} to markdown") markdown_text = pymupdf4llm.to_markdown(pdf_filename) ---> never gets to here logging.info(f"Finished converting {pdf_filename} to markdown")

error is reproducible with e.g. the follwing pdf file: https://www.medrxiv.org/content/10.1101/2020.07.16.20153437v1.full.pdf but happens with literally hundreds of such pdfs whcih all open aand display without any issues in any pdf viewer I tried. Sadly I know nothing about PDFs and how to address this issue.

This is the output when I interrupt the hanging process:

import pymupdf4llm as pymu md=pymu.to_markdown("/Users/hherb/Downloads/2020.07.16.20153437v1.full.pdf") ^CTraceback (most recent call last): File "", line 1, in File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 544, in to_markdown page_output, images, tables, graphics = get_page_output( ^^^^^^^^^^^^^^^^ File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 516, in get_page_output md_string += write_text( ^^^^^^^^^^^ File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 214, in write_text nlines = get_raw_lines(textpage, clip=clip, tolerance=3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf4llm/helpers/get_text_lines.py", line 62, in get_raw_lines for b in textpage.extractDICT()["blocks"] ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf/init.py", line 12502, in extractDICT val = self._textpage_dict(raw=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf/init.py", line 12442, in _textpage_dict self._getNewBlockList(page_dict, raw) File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf/init.py", line 12438, in _getNewBlockList JM_make_textpage_dict(self.this, page_dict, raw) File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf/init.py", line 16471, in JM_make_textpage_dict return extra.JM_make_textpage_dict(tp.m_internal, page_dict, raw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hherb/anaconda3/envs/medai/lib/python3.12/site-packages/pymupdf/extra.py", line 189, in JM_make_textpage_dict return _extra.JM_make_textpage_dict(tp, page_dict, raw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyboardInterrupt

JorjMcKie commented 3 months ago

Thank you for the example file.

The problem is caused by the last handful of pages. They contain a tremendous amount of vector graphics. Especially page 44 (0-based) - there are over 50,000 drawings on this page, which leads to extremely long response times in the effort to cluster the commands to larger graphics.

Even when assuming that this could be done fast enough, the gain in terms of text content insight would be close to zero. So I suggest to find a threshold beyond which the page should be ignored entirely or maybe replaced by its image.

If you change your extraction script to the following, you should avoid this type of problem at least for the time being:

import pymupdf, pymupdf4llm, pathlib

filename = "test.pdf"
doc = pymupdf.open(filename)

# create the header font size info as a separate step
hdr = pymupdf4llm.IdentifyHeaders(doc)

text = ""
for page in doc:
    paths = page.get_cdrawings()  # fast extraction of vector graphics
    if len(paths) > 5000:  # skip page with too many graphics
        print(f"Omitted {page.number=}")
        continue
    md_text = pymupdf4llm.to_markdown(doc, pages=[page.number], hdr_info=hdr)
    text += md_text

pathlib.Path("test.md").write_bytes(text.encode())

So this extracts each single page using the pages parameter. To avoid the effort computing the header font size info on each invocation, we are doing it once and pass that information to the single executions.

hherb commented 3 months ago

Thank you!

pymupdf / RAG

pymupdg4llm. #50