pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

This pdf would cause stack overflow exception, #3596

Closed xsank closed 1 week ago

xsank commented 1 week ago

Description of the bug

2008.07542.pdf

the page 4 or index(i) 3, would cause the problem below:

  File "/Users/xsank/Work/code/chatos/pdf-extracter/core/parsers/pdf/pax_parser.py", line 52, in parse
    block_dict[i] = doc[i].get_text("dict")["blocks"]
  File "/Users/xsank/opt/anaconda3/envs/pdf-extracter/lib/python3.9/site-packages/fitz_new/utils.py", line 802, in get_text
    tp = page.get_textpage(clip=clip, flags=flags)
  File "/Users/xsank/opt/anaconda3/envs/pdf-extracter/lib/python3.9/site-packages/fitz_new/__init__.py", line 8959, in get_textpage
    textpage = self._get_textpage(clip, flags=flags, matrix=matrix)
  File "/Users/xsank/opt/anaconda3/envs/pdf-extracter/lib/python3.9/site-packages/fitz_new/__init__.py", line 7663, in _get_textpage
    ll_tpage = extra.page_get_textpage(self.this, clip, flags, matrix)
  File "/Users/xsank/opt/anaconda3/envs/pdf-extracter/lib/python3.9/site-packages/fitz_new/extra.py", line 192, in page_get_textpage
    return _extra.page_get_textpage(_self, clip, flags, matrix)
RuntimeError: code=2: exception stack overflow!

How to reproduce the bug

doc[3].get_text("dict")["blocks"]

PyMuPDF version

1.23.x or earlier

Operating system

Linux

Python version

3.9

JorjMcKie commented 1 week ago

This PDF is corrupt. Object at xref 11422 is invalid, preventing text extraction. Look into pymupdf.TOOLS.mupdf_warnings() to confirm. Rendering the page is still possible - although my Adobe Acrobat refuses to even show this page.

xsank commented 1 week ago

But the mac preview and chrome could show the page both.

JorjMcKie commented 1 week ago

But the mac preview and chrome could show the page both.

That is what I wrote: you can still render the page!