Closed prgx-csmith01 closed 3 years ago
Hi @prgx-csmith01 Would it be possible for you to redact everything from the PDF and then share it so that it can be added as a test to PR #298 ?
Hi @samkit-jain , I can't share the PDF but we have created a test file for you with an example of the metadata issue. I hope this helps. Thanks!
Many thanks @prgx-csmith01 I have updated the PR #298 with the test case.
As an aside: That integer value of the Copies entry is invalid.
According to the specification:
14.3.3 Document Information Dictionary
...
The value associated with any key not specifically mentioned in Table 317 shall be a text string.
(ISO 32000-1)
... and neither is there any Copies entry in table 317 nor any other entry with a numeric type, merely text strings, dates, and names.
Thus, this issue strictly speaking is not a bug (as labeled currently) but a request to support one more type of invalid PDFs.
@mkl-public That's a good point, and thank you for raising it. I think your diagnosis is correct. I certainly don't want to slide down the slippery slope of trying to handle all malformed PDFs. In this case, however, @samkit-jain has PR'ed an efficient solution — it's a simple adjustment, and one that hopefully will accommodate a few other classes of invalid metadata entries in the future (without becoming a burden on the processing of valid PDFs).
Closed via https://github.com/jsvine/pdfplumber/pull/298; now available in develop
and will appear in the next release.
I have received this error message for a PDF file:
It seems that there is no handling for integer metadata in the init of pdf.py
Previously there was a similar bug raised #67 for boolean objects.
I cannot provide the PDF used that caused this error as it is client data. The metadata of the file contains { ... , "Copies" : 0 }.