jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

TypeError: argument of type 'PDFObjRef' is not iterable #935

Open caolf opened 1 year ago

caolf commented 1 year ago

Describe the bug

raise TypeError: argument of type 'PDFObjRef' is not iterable when exec extract_tables(table_settings=table_settings) for page 3 , but page 1 or page 2 is ok

Code to reproduce the problem

image

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Screenshots

pdfplumberlib.py:293:


/Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:300: in extract_tables tables = self.find_tables(tset) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:294: in find_tables return TableFinder(self, tset).tables /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/table.py:570: in init self.edges = self.get_edges() /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/table.py:600: in get_edges words = self.page.extract_words((settings.text_settings or {})) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:356: in extract_words return utils.extract_words(self.chars, kwargs) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/container.py:50: in chars return self.objects.get("char", []) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:215: in objects self._objects: Dict[str, T_obj_list] = self.parse_objects() /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:275: in parse_objects for obj in self.iter_layout_objects(self.layout._objs): /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:161: in layout interpreter.process_page(self.page_obj) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:997: in process_page self.render_contents(page.resources, page.contents, ctm=ctm) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:1014: in render_contents self.init_resources(resources) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:384: in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:234: in get_font font = self.get_font(None, subspec) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:225: in get_font font = PDFCIDFont(self, spec) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdffont.py:1072: in init ttf = TrueTypeFont(self.basefont, BytesIO(self.fontfile.get_data())) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdftypes.py:396: in get_data self.decode()


self = <PDFStream(119): raw=64251, {'Length': 64251, 'Filter': /'FlateDecode', 'DecodeParms': , 'Length1': 214528}>

def decode(self) -> None:
    assert self.data is None and self.rawdata is not None, str(
        (self.data, self.rawdata)
    )
    data = self.rawdata
    if self.decipher:
        # Handle encryption
        assert self.objid is not None
        assert self.genno is not None
        data = self.decipher(self.objid, self.genno, data, self.attrs)
    filters = self.get_filters()
    if not filters:
        self.data = data
        self.rawdata = None
        return
    for (f, params) in filters:
        if f in LITERALS_FLATE_DECODE:
            # will get errors if the document is encrypted.
            try:
                data = zlib.decompress(data)

            except zlib.error as e:
                if settings.STRICT:
                    error_msg = "Invalid zlib bytes: {!r}, {!r}".format(e, data)
                    raise PDFException(error_msg)

                try:
                    data = decompress_corrupted(data)
                except zlib.error:
                    data = b""

        elif f in LITERALS_LZW_DECODE:
            data = lzwdecode(data)
        elif f in LITERALS_ASCII85_DECODE:
            data = ascii85decode(data)
        elif f in LITERALS_ASCIIHEX_DECODE:
            data = asciihexdecode(data)
        elif f in LITERALS_RUNLENGTH_DECODE:
            data = rldecode(data)
        elif f in LITERALS_CCITTFAX_DECODE:
            data = ccittfaxdecode(data, params)
        elif f in LITERALS_DCT_DECODE:
            # This is probably a JPG stream
            # it does not need to be decoded twice.
            # Just return the stream to the user.
            pass
        elif f in LITERALS_JBIG2_DECODE:
            pass
        elif f in LITERALS_JPX_DECODE:
            pass
        elif f == LITERAL_CRYPT:
            # not yet..
            raise PDFNotImplementedError("/Crypt filter is unsupported")
        else:
            raise PDFNotImplementedError("Unsupported filter: %r" % f)
        # apply predictors
      if params and "Predictor" in params:

E TypeError: argument of type 'PDFObjRef' is not iterable

Environment

looking forward to your help! Thanks

caolf commented 1 year ago

image

caolf commented 1 year ago

@jsvine looking forward to your help! Thanks

cmdlineluser commented 1 year ago

Hi @caolf

Just thought I'd add some info:

This seems to have come up before #316

Although it seems in this case, the exception is coming from the underlying pdfminer library:

/Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11
/lib/python3.11/site-packages/pdfminer/pdftypes.py:396:
                              ^^^^^^^^

https://github.com/pdfminer/pdfminer.six/issues/495 seems to be the same bug.

samkit-jain commented 1 year ago

Thanks for reporting @caolf Request you to please share the PDF that has the issue too.

caolf commented 1 year ago

@samkit-jain I'm sorry, this is an internal document and cannot be made public!

samkit-jain commented 1 year ago

@caolf Okay, see if you can redact the sensitive information and make it ready to attach here. If not, without it, it will be a bit difficult to properly debug and fix (if pdfplumber issue).

cmdlineluser commented 1 year ago

Hi @samkit-jain

There is an example PDF from https://github.com/pdfminer/pdfminer.six/issues/495#issuecomment-1594322455 which raises the same exception if you're interested:

https://github.com/pdfminer/pdfminer.six/files/11768084/pdfminer_testpart.pdf

I don't really know anything about PDF internals, but the issue seems to be the PDFObjRef object is ending up in DecodeParms when it shouldn't?

https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/pdfparser.py#L88

dic={'Length': 4065, 'Length1': 8964, 'Filter': /'FlateDecode', 'DecodeParms': <PDFObjRef:49>}
samkit-jain commented 1 year ago

Thanks for the PDF @cmdlineluser I'll see if there's something that we can do