jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

TypeError: 'PDFObjRef' object is not iterable #316

Closed Abdur-rahmaanJ closed 3 years ago

Abdur-rahmaanJ commented 3 years ago

Describe the bug

Got

Traceback (most recent call last):
  File "main.py", line 5, in <module>
    with pdfplumber.open("<stripped_path>") as pdf:
  File "<stripped_path>\venv\lib\site-packages\pdfplumber\pdf.py", line 46, in open
    return cls(open(path_or_fp, "rb"), **kwargs)
  File "<stripped_path>\venv\lib\site-packages\pdfplumber\pdf.py", line 33, in __init__
    self.metadata[k] = list(map(decode_text, v))
  File "<stripped_path>\pdfres\venv\lib\site-packages\pdfplumber\utils.py", line 77, in decode_text
    ords = (ord(c) if type(c) == str else c for c in s)
TypeError: 'PDFObjRef' object is not iterable

Code to reproduce the problem

import pdfplumber

with pdfplumber.open("target.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

Environment

samkit-jain commented 3 years ago

Hi @Abdur-rahmaanJ Appreciate your interest in the library. Would it be possible for you to share a PDF to demonstrate this issue? Will help us in reproducing and fixing the issue. Please remove any sensitive information from the PDF before sharing it here.

Abdur-rahmaanJ commented 3 years ago

Try it on any research gate pdf. If you dont get the error on windows, i'll send you the exact pdf

samkit-jain commented 3 years ago

I chose this PDF and it ran fine for me. I got

{'fontname': 'SourceSansPro-Regular', 'adv': Decimal('4.803'), 'upright': True, 'x0': Decimal('39.870'), 'y0': Decimal('711.959'), 'x1': Decimal('43.061'), 'y1': Decimal('717.939'), 'width': Decimal('3.192'), 'height': Decimal('5.980'), 'size': Decimal('5.980'), 'object_type': 'char', 'page_number': 1, 'stroking_color': (0, 0, 0), 'non_stroking_color': (0, 0, 0), 'text': 'S', 'top': Decimal('74.061'), 'bottom': Decimal('80.041'), 'doctop': Decimal('74.061')}

as the output

The thing to note is that I am using Ubuntu and not Windows. If for the PDF I used, you are seeing the same error, then, it might be OS specific. If not, then it might be PDF specific and would request you to share the PDF you used.

Abdur-rahmaanJ commented 3 years ago

Try checking this one

samkit-jain commented 3 years ago

Thank you for sharing the PDF @Abdur-rahmaanJ The issue is coming because the PDF has a metadata field by the name Changes which is a list of PDFObjRef objects. I am not sure if that is allowed by the PDF specifications (linking https://github.com/jsvine/pdfplumber/issues/297#issuecomment-718862330) but nonetheless, it is something that can be handled in the code. I shall raise a PR for it soon.

Abdur-rahmaanJ commented 3 years ago

XD since this was the first PDF checked, i assumed pdfplumber was broken!