jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Decode Integer Metadata #297

Closed prgx-csmith01 closed 3 years ago

prgx-csmith01 commented 3 years ago

I have received this error message for a PDF file:

Traceback (most recent call last):
  File "####", line 145, in ####
    pdf = pdfplumber.load( #### )
  File "/opt/venv-app-microservice/lib/python3.7/site-packages/pdfplumber/__init__.py", line 11, in load
    return PDF(file_or_buffer, **kwargs)
  File "/opt/venv-app-microservice/lib/python3.7/site-packages/pdfplumber/pdf.py", line 42, in __init__
    self.metadata[k] = decode_text(v)
  File "/opt/venv-app-microservice/lib/python3.7/site-packages/pdfplumber/utils.py", line 70, in decode_text
    ords = (ord(c) if type(c) == str else c for c in s)
TypeError: 'int' object is not iterable

It seems that there is no handling for integer metadata in the init of pdf.py

Previously there was a similar bug raised #67 for boolean objects.

I cannot provide the PDF used that caused this error as it is client data. The metadata of the file contains { ... , "Copies" : 0 }.

samkit-jain commented 3 years ago

Hi @prgx-csmith01 Would it be possible for you to redact everything from the PDF and then share it so that it can be added as a test to PR #298 ?

prgx-csmith01 commented 3 years ago

Hi @samkit-jain , I can't share the PDF but we have created a test file for you with an example of the metadata issue. I hope this helps. Thanks!

test_int_metadata.pdf

samkit-jain commented 3 years ago

Many thanks @prgx-csmith01 I have updated the PR #298 with the test case.

mkl-public commented 3 years ago

As an aside: That integer value of the Copies entry is invalid.

According to the specification:

14.3.3 Document Information Dictionary

...

The value associated with any key not specifically mentioned in Table 317 shall be a text string.

(ISO 32000-1)

... and neither is there any Copies entry in table 317 nor any other entry with a numeric type, merely text strings, dates, and names.

Thus, this issue strictly speaking is not a bug (as labeled currently) but a request to support one more type of invalid PDFs.

jsvine commented 3 years ago

@mkl-public That's a good point, and thank you for raising it. I think your diagnosis is correct. I certainly don't want to slide down the slippery slope of trying to handle all malformed PDFs. In this case, however, @samkit-jain has PR'ed an efficient solution — it's a simple adjustment, and one that hopefully will accommodate a few other classes of invalid metadata entries in the future (without becoming a burden on the processing of valid PDFs).

jsvine commented 3 years ago

Closed via https://github.com/jsvine/pdfplumber/pull/298; now available in develop and will appear in the next release.