Be more explicit about about whether whitespace is acceptable after `startxref`

jrmuizel commented 2 months ago

The spec says:

"The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the PDF file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary"

lopdf interprets this to mean that only line endings are allowed after startxref (https://github.com/J-F-Liu/lopdf/issues/318) but most other pdf readers seem to allow any whitespace: https://github.com/mozilla/pdf.js/blob/0676ea19cf17023ec8c2d6ad69a859c345c01dc1/src/core/document.js#L994 https://gitlab.freedesktop.org/poppler/poppler/-/blob/e23bd900b7c5f3262c3b6c5fb20d7569ca5193db/poppler/PDFDoc.cc?page=3#L2091 https://github.com/ArtifexSoftware/mupdf/blob/1d58f734a2a0302d9e3b7406509a30f587ada791/source/pdf/pdf-xref.c#L964

There are lots of PDFs produced by Crystal Reports that have [space]eol instead of just eol. http://spasummerclassic.alkamelsystems.com/Results/03_2021/01_Spa%20Summer%20Classic/320_Spa%203%20Hours/202106251945_Qualifying/15_EventMaxiumSpeed_Qualifying.PDF for example.

It would be nice to have more clarity about what whitespace is allowed between these tokens.

petervwyatt commented 2 months ago

I agree that the legacy clauses defining the lexical rules for various keywords are not great as they were written in a time before ISO-ese. We are also moving to use EBNF railroad diagrams to clarify such things. However, there can be differences between what the spec formally says and what (older?) software actually did (e.g. you may encounter PDFs that have comments near this location with their name and version).

Clause 7.2.3 states that "PDF syntax treats any sequence of consecutive white-space characters, not inside of a string or stream, as one character. The characters that are considered white-space characters are shown in "Table 1 — White-space characters"" and Table 1 includes the EOL bytes. Comments are also effectively treated whitespace so this means that whitespace can occur before the keyword startxref, and whitespace and/or a comment can occur after the startxref keyword. Furthermore, comments cannot occur in a traditional cross-reference section between the xref keyword up to the trailer keyword, but additional whitespace can occur on those lines on either side of those keywords (since the xref is located by a byte offset not by line parsing).

Various ISO subset specifications (like PDF/A) also explicitly prohibit some of this and validators will reject such PDFs.

See also Errata #202 and Errata #363 for other similar discussions regarding lexical rules in content streams and between keywords and operators.

mkl-public commented 2 months ago

Indeed, the clause Peter quoted creates the impression that everywhere in a PDF (outside string or stream) an end-of-line is equivalent to any non-empty combination of whitespaces and comments. In particular, adding a space character before an end-of-line marker should be ok.

While this is true in many regions of a PDF, it cannot apply to a region that is defined by certain "lines with specific contents", otherwise that region definition would become meaningless. So here the specific definition of the region overrides the general whitespace rule.

Thus, lopdf simply implements a strict interpretation of the spec: If the general whitespace rule cannot be applied fully here, it shall not be applied at all.

Other PDF processors may interpret this situation a bit differently and - while not allowing to change the number of end-of-line whitespaces - allow extra other whitespaces.

pdf-association / pdf-issues

Be more explicit about about whether whitespace is acceptable after `startxref` #464