jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

TypeError: argument of type 'PDFObjRef' is not iterable #1120

Closed ibecav closed 2 months ago

ibecav commented 2 months ago

Describe the bug

As with several others I have encountered this error when using the module. For example #935. I encountered it using an exact copy of your example script for extracting form values here https://github.com/jsvine/pdfplumber?tab=readme-ov-file#extracting-form-values but with the example pdf I am enclosing.

Have you tried repairing the PDF?

Yes, the results were (I had to laugh because yes, it really is a pdf file and it certainly renders correctly on screen):

Traceback (most recent call last):
  File "C:\Users\PowellCh\Desktop\RProjs\production_hai\clogged_pdf_toilet.py", line 4, in <module>
    pdf = pdfplumber.open("example.pdf", repair=True)
  File "C:\Users\PowellCh\AppData\Roaming\Python\Python312\site-packages\pdfplumber\pdf.py", line 95, in open
    return cls(
  File "C:\Users\PowellCh\AppData\Roaming\Python\Python312\site-packages\pdfplumber\pdf.py", line 45, in __init__
    self.doc = PDFDocument(PDFParser(stream), password=password or "")
  File "C:\Users\PowellCh\AppData\Roaming\Python\Python312\site-packages\pdfminer\pdfdocument.py", line 752, in __init__
    raise PDFSyntaxError("No /Root object! - Is this really a PDF?")
pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?

Code to reproduce the problem

As stated above a simple copy of one of your examples run against the example pdf.

import pdfplumber
from pdfplumber.utils.pdfinternals import resolve_and_decode, resolve

pdf = pdfplumber.open("example.pdf", repair=True)

def parse_field_helper(form_data, field, prefix=None):
    """ appends any PDF AcroForm field/value pairs in `field` to provided `form_data` list

        if `field` has child fields, those will be parsed recursively.
    """
    resolved_field = field.resolve()
    field_name = '.'.join(filter(lambda x: x, [prefix, resolve_and_decode(resolved_field.get("T"))]))
    if "Kids" in resolved_field:
        for kid_field in resolved_field["Kids"]:
            parse_field_helper(form_data, kid_field, prefix=field_name)
    if "T" in resolved_field or "TU" in resolved_field:
        # "T" is a field-name, but it's sometimes absent.
        # "TU" is the "alternate field name" and is often more human-readable
        # your PDF may have one, the other, or both.
        alternate_field_name  = resolve_and_decode(resolved_field.get("TU")) if resolved_field.get("TU") else None
        field_value = resolve_and_decode(resolved_field["V"]) if 'V' in resolved_field else None
        form_data.append([field_name, alternate_field_name, field_value])

form_data = []
fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"]
for field in fields:
    parse_field_helper(form_data, field)

PDF file

FWIW it's a fillable form pdf created by the CDC and saved locally after filling.

example.pdf

Expected behavior

I expected it to work the same way your example code does. The code does work on other pdf files that aren't of this type.

Actual behavior

Traceback (most recent call last):
  File "C:\Users\PowellCh\Desktop\RProjs\production_hai\clogged_pdf_toilet.py", line 27, in <module>
    for field in fields:
TypeError: 'PDFObjRef' object is not iterable

Screenshots

I can't think of any that would be helpful but please inform if otherwise

Environment

Additional context

My apologies in advance if I forgot any details in this issue. I'm new to Python and your excellent module but have experience in other languages. My current hypothesis based on reading other issues is that there is something non standard about the pdf itself but I am hopeful there is a workaround.

jeremybmerrill commented 2 months ago

Looks like calling resolve() on fields fixes the problem.

Replace fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"] with

fields = resolve(resolve(pdf.doc.catalog["AcroForm"])["Fields"])

and it looks like it works. I think we could modify the example code to do this.

ibecav commented 2 months ago

Thank you. I'll try this fix in a little bit. As to changing the example I'll leave that to your discretion I'm by no means an expert but my understanding is that PDFs can be fickle and as I noted your example does work on some PDFs as is.

ibecav commented 2 months ago

Thank you, that does indeed seem to resolve the error.

jsvine commented 2 months ago

Thanks @jeremybmerrill for the solution, and @ibecav for flagging. I've now updated the example code in the README.

jeremybmerrill commented 2 months ago

great! I'm by no means an expert either -- all standards-compliant PDFs are alike, but all weird PDFs are weird in their own unique way -- but I do know that calling resolve() at every opportunity seems to make problems disappear.