pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.79k stars 918 forks source link

TypeError: 'PDFObjRef' object is not iterable #1004

Open corobin opened 1 month ago

corobin commented 1 month ago

after updating to version 20240706 extract_text() on a pdf throws an error TypeError: 'PDFObjRef' object is not iterable

this did not occur on the previous version 20231228

Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> from pdfminer.high_level import extract_text
>>> text = extract_text("Working.pdf")
>>> text = extract_text("Error.pdf")
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    text = extract_text(path)
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 63, in __init__
    mediabox_params: List[Any] = [
TypeError: 'PDFObjRef' object is not iterable
>>>

Working.pdf - newly created blank page with acrobat

Error.pdf - downloaded, I cannot change the process of its creation. I deleted all visible text on the page which did not appear to affect the behaviour of the error

felixxm commented 1 month ago

We hit the same issue with next(high_level.extract_pages(pdf_page_path)) calls.

myhloli commented 1 month ago

same error with this:https://github.com/opendatalab/MinerU/issues/198

dhdaines commented 1 month ago

Probably need to call resolve1 on self.attrs["MediaBox"] as well... it's indirect objects all the way down...

MarcoPellicciottaOSF commented 1 month ago

Probably need to call resolve1 on self.attrs["MediaBox"] as well... it's indirect objects all the way down...

I had same error, using resolve1 fixed it for me.