pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
64 stars 2 forks source link

Additional explicit requirements for EOLs around startxref, xref and trailer keywords #112

Open petervwyatt opened 3 years ago

petervwyatt commented 3 years ago

This discussion has arisen in the ISO TC 171 SC 2 WG 8 "Securing PDF" discussion group and was suggested to be raised here.

ISO 32000-2:2020 does not state many explicit requirements around EOLs before and after the keywords startxref, xref and trailer. It might be argued that existing language (e.g. via the use of the word "line") might imply EOLs before and/or after some of these keywords, but we can and should do better and be explicit.

Proposals:

petervwyatt commented 2 years ago

ISO 32000-1:2008 Annex C (normative) also has this statement with "should" requirements:

When a conforming reader reads a PDF file with a damaged or missing cross-reference table, it may attempt to rebuild the table by scanning all the objects in the file. However, the generation numbers of deleted entries are lost if the cross-reference table is missing or severely damaged. To facilitate such reconstruction, object identifiers, the endobj keyword, and the endstream keyword should appear at the start of a line. Also, the data within a stream should not contain a line beginning with the word endstream, aside from the required endstream that delimits the end of the stream.

GitHubRulesOK commented 6 months ago

For consistency it would be best if all those cases were considered equal. That is an end of file is causing many real world problems by NOT having the same status at real EOF as anywhere else in a file.

Explanation of what I mean, There are many PDF writers that will consider that the %%EOF at the end of a /Linearized section MUST follow the rule ALL lines are terminated thus %%EOF is automatically followed by implied mandatory EOL. HOWEVER when it comes to "current" final %%EOF the old guides and current standards only mention a need for "optional" EOL after an EOF and worst of all encourage no EOL by the statement in 7.5.5

...The last line of the file shall contain only the end-of-file marker, %%EOF....

Hence often a file that is to be signed or edited has no REAL physical EOF marker simply an "F" in placement

Why does this matter is because the next appendage often starts without an EOL, thus we constantly find in the real word examples such as %%EOF1000 0 Obj<EOL> clearly a problem for parsing signatures and annotations.

Signature failure and annotation failure are often undetected until a later stage.

Proposal remove the optional use of EOL and ensure

mkl-public commented 6 months ago

Why does this matter is because the next appendage often starts without an EOL, thus we constantly find in the real word examples such as %%EOF1000 0 Obj<EOL> clearly a problem for parsing signatures and annotations.

PDF processors can prevent creating such problems by starting an incremental update with an EOL, at least if they find no EOL at the current end of the PDF.

Admittedly, depending on your choice of EOL this can make Adobe Acrobat unhappy when validating changes in incremental updates to signed PDFs, see this stack overflow q&a. But on one hand this can be considered an Acrobat bug and on the other there is a clear work-around, using a different EOL,

Thus, I'd consider clarifying this merely a nice-to-have.

Also I'd propose to implement the clarification not by a specific requirement for %%EOF but instead by defining the term line (as in line in the file, not as in line in a path). If I recall correctly, there is no real definition thereof in the spec.

GitHubRulesOK commented 6 months ago

Hi @mkl-public (KJ here) the example you cite was one of several I encountered more recently, but the reason I was prompted to raise the issue here is

https://stackoverflow.com/questions/78177339/merge-annotations-of-pdf-files-using-python-borb-library

Borb does not add EOL and Evince does not start appendages with EOL (there have been other similar cases in the past) and adding more annotations compounds the problem until the file is corrected, which could be after signing.

mkl-public commented 6 months ago

Actually I would recommend pointing out to borb and evince development that currently a PDF may end without a EOL and asking them to add one, at least if there is none yet, when appending to a PDF.

Clarifying this in the spec of course would be nice to have, but even then there will still be PDFs without an EOL at the end for many years to come.