jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.5k stars 658 forks source link

Mysterious overflow error #1191

Closed andersfylling closed 1 month ago

andersfylling commented 1 month ago

Describe the bug

Opening a pdf file with pdf plumber causes an overflow error. Unfortunately I cannot share the file as it's gone from existence.

Expected behavior

I'm not certain what would be a fair expectation here.

Actual behavior

Traceback (most recent call last):
  File ...
    with pdfplumber.open(file) as pdf:
  File "/usr/local/lib/python3.10/site-packages/pdfplumber/pdf.py", line 135, in __exit__
    self.close()
  File "/usr/local/lib/python3.10/site-packages/pdfplumber/pdf.py", line 120, in close
    for page in self.pages:
  File "/usr/local/lib/python3.10/site-packages/pdfplumber/pdf.py", line 145, in pages
    for i, page in enumerate(PDFPage.create_pages(self.doc)):
  File "/usr/local/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 133, in create_pages
    yield cls(document, objid, obj, next(page_labels))
  File "/usr/local/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 72, in __init__
    contents = resolve1(self.attrs["Contents"])
  File "/usr/local/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 118, in resolve1
    x = x.resolve(default=default)
  File "/usr/local/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 106, in resolve
    return self.doc.getobj(self.objid)
  File "/usr/local/lib/python3.10/site-packages/pdfminer/pdfdocument.py", line 866, in getobj
    obj = self._getobj_parse(index, objid)
  File "/usr/local/lib/python3.10/site-packages/pdfminer/pdfdocument.py", line 840, in _getobj_parse
    (_, obj) = self._parser.nextobject()
  File "/usr/local/lib/python3.10/site-packages/pdfminer/psparser.py", line 656, in nextobject
    self.do_keyword(pos, token)
  File "/usr/local/lib/python3.10/site-packages/pdfminer/pdfparser.py", line 105, in do_keyword
    data = bytearray(self.fp.read(objlen))
OverflowError: cannot fit 'int' into an index-sized integer

Environment

jsvine commented 1 month ago

Based on the stack trace, the error you encountered appears to stem from pdfminer.six, pdfplumber's main dependency. Without access to the PDF file, it will be difficult to investigate further. Thank you, regardless, for your interest in pdfplumber.