Open jrmuizel opened 1 month ago
I agree that the legacy clauses defining the lexical rules for various keywords are not great as they were written in a time before ISO-ese. We are also moving to use EBNF railroad diagrams to clarify such things. However, there can be differences between what the spec formally says and what (older?) software actually did (e.g. you may encounter PDFs that have comments near this location with their name and version).
Clause 7.2.3 states that "PDF syntax treats any sequence of consecutive white-space characters, not inside of a string or stream, as one character. The characters that are considered white-space characters are shown in "Table 1 — White-space characters"" and Table 1 includes the EOL bytes. Comments are also effectively treated whitespace so this means that whitespace can occur before the keyword startxref
, and whitespace and/or a comment can occur after the startxref
keyword. Furthermore, comments cannot occur in a traditional cross-reference section between the xref
keyword up to the trailer
keyword, but additional whitespace can occur on those lines on either side of those keywords (since the xref
is located by a byte offset not by line parsing).
Various ISO subset specifications (like PDF/A) also explicitly prohibit some of this and validators will reject such PDFs.
See also Errata #202 and Errata #363 for other similar discussions regarding lexical rules in content streams and between keywords and operators.
Indeed, the clause Peter quoted creates the impression that everywhere in a PDF (outside string or stream) an end-of-line is equivalent to any non-empty combination of whitespaces and comments. In particular, adding a space character before an end-of-line marker should be ok.
While this is true in many regions of a PDF, it cannot apply to a region that is defined by certain "lines with specific contents", otherwise that region definition would become meaningless. So here the specific definition of the region overrides the general whitespace rule.
Thus, lopdf simply implements a strict interpretation of the spec: If the general whitespace rule cannot be applied fully here, it shall not be applied at all.
Other PDF processors may interpret this situation a bit differently and - while not allowing to change the number of end-of-line whitespaces - allow extra other whitespaces.
The spec says:
"The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the PDF file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary"
lopdf interprets this to mean that only line endings are allowed after startxref (https://github.com/J-F-Liu/lopdf/issues/318) but most other pdf readers seem to allow any whitespace: https://github.com/mozilla/pdf.js/blob/0676ea19cf17023ec8c2d6ad69a859c345c01dc1/src/core/document.js#L994 https://gitlab.freedesktop.org/poppler/poppler/-/blob/e23bd900b7c5f3262c3b6c5fb20d7569ca5193db/poppler/PDFDoc.cc?page=3#L2091 https://github.com/ArtifexSoftware/mupdf/blob/1d58f734a2a0302d9e3b7406509a30f587ada791/source/pdf/pdf-xref.c#L964
There are lots of PDFs produced by Crystal Reports that have
[space]eol
instead of justeol
. http://spasummerclassic.alkamelsystems.com/Results/03_2021/01_Spa%20Summer%20Classic/320_Spa%203%20Hours/202106251945_Qualifying/15_EventMaxiumSpeed_Qualifying.PDF for example.It would be nice to have more clarity about what whitespace is allowed between these tokens.