ssimms / pdfapi2

Create, modify, and examine PDF files in Perl
Other
15 stars 20 forks source link

Improve handling of files with additional text in header #22

Closed dasyurid closed 1 year ago

dasyurid commented 4 years ago

Some devices - notably Sharp MFDs - add extra text after the PDF version in the header line. While the standard doesn't forbid this, it doesn't explicitly allow it either. Nonetheless, most readers will open these files without comment. PDF::API2::Basic::PDF::File.pm rejects these files as invalid.

This change allows extra text on the header line to allow PDF::API2 to match generally accepted functionality. I've been using this patch in production for some time now with no problems.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.008%) to 56.834% when pulling 5a654d8c1632052e56a04db72e0fde26e84bfb11 on dasyurid:master into fcc73b15b2e1b837a42689e1294be6868107e8b8 on ssimms:master.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.008%) to 56.834% when pulling 5a654d8c1632052e56a04db72e0fde26e84bfb11 on dasyurid:master into fcc73b15b2e1b837a42689e1294be6868107e8b8 on ssimms:master.

PhilterPaper commented 4 years ago

Take a look at ticket RT 106020 (rejected in PDF::API2, fixed in PDF::Builder). It sounds like the same thing.

ssimms commented 1 year ago

This is indeed the same issue as RT 106020. The current PDF specification explicitly forbids extra characters on the first line of the PDF.

2.0:

The file header shall consist of “%PDF–1.n” or “%PDF–2.n” followed by a single EOL marker, where ‘n’ is a single digit number between 0 (30h) and 9 (39h).

Compare 1.7 (which in my opinion is also clear, but not quite as explicit):

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.

Because of this, I'm opting to have PDF::API2 continue to consider these files as invalid PDFs, with the thought that generators that don't follow easy parts of the spec are likely to have issues with harder parts of the spec. That said, the workaround given in the RT ticket should still be valid if you want to take that chance. You can also write a script to turn the invalid PDF into a maybe-valid one by replacing the ninth and tenth characters with \n# (inserting the required newline and ignoring anything else that was on that line) as long as there are at least ten characters before the first newline.