find_tables cost several minutes for one page

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.31k stars 647 forks source link

find_tables cost several minutes for one page #807

Closed buptyyf closed 1 year ago

buptyyf commented 1 year ago

Describe the bug

page.find_tables() cost several minutes and use 100%cpu.

pdfplumber_doc = pdfplumber.open(pdf_path)
pdfplumber_pages = pdfplumber_doc.pages
tables = pdfplumber_pages[20].find_tables()

PDF file

CB_7ZM3m03nICqS6eS6cf1hxEKk.pdf

Please attach any PDFs necessary to reproduce the problem.

jsvine commented 1 year ago

Hi @buptyyf, and thanks for your interest in this library. It seems that the PDF is malformed; if you repair it with GhostScript, the code above executes nearly instantaneously.

buptyyf commented 1 year ago

Hi @buptyyf, and thanks for your interest in this library. It seems that the PDF is malformed; if you repair it with GhostScript, the code above executes nearly instantaneously.

Thanks a lot. How can I check pdf is malformed before I repair it?

jsvine commented 1 year ago

I don't have a definitive answer, unfortunately, but a web-search (e.g., for "validate pdf ghostscript") returns some results that make me think it's possible.

buptyyf commented 1 year ago

I don't have a definitive answer, unfortunately, but a web-search (e.g., for "validate pdf ghostscript") returns some results that make me think it's possible.

I also use another pdf parser lib Pymupdf to parse this pdf. Parsing is very fast, and don't show it's malformed.

jsvine commented 1 year ago

Yes, different PDF-parsing programs have different tolerances for malformed documents. In this case, pdfplumber depends on pdfminer.six for extracting PDF text/graphics locations, and pdfminer.six handles malformed PDFs differently than MyPDF (which is what Pymupdf uses).