pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
62 stars 2 forks source link

Comments in stream data #273

Open LegionMammal978 opened 1 year ago

LegionMammal978 commented 1 year ago

In ISO 32000-2:2020, 7.2.4 states,

Any occurrence of the PERCENT SIGN (25h) outside a string or inside a content stream (see 7.8.2, "Content streams") introduces a comment.

This implies that a percent sign inside a stream always introduces a comment. 7.2.3 states, "The rules defined in this subclause apply to all characters in the file except within strings, streams, and comments," but the statement in 7.2.4 does not appeal to the classification of characters defined in 7.2.3 and is not bound by its limitations. This is a breaking change from PDF-1.7, which instead states "outside a string or stream" in 7.2.3.

If a percent sign inside a non-content stream does not always introduce a comment, then there is still the question of whether a percent sign within the decoded data of an object stream can introduce a comment. 7.5.7 states that "the N objects are stored consecutively" in an object stream following the list of byte offsets. Does 7.2.4 apply to parsing these objects after decoding the stream data? 7.2.1 suggests that objects as syntactic entities are formed from tokenized bytes, using the ordinary syntax rules which accept comments:

At the most fundamental level, a PDF file is a sequence of bytes. These bytes can be grouped into tokens according to the syntax rules described in subclauses 7.2.2, "Representation" through 7.2.4, "Comments". One or more tokens are assembled to form higher-level syntactic entities, principally objects, which are the basic data values from which a PDF file is constructed.

However, if comments are permitted in object streams, then further clarification is needed in 7.5.7. In particular, what if one object in an object stream is trailed followed by a comment with no EOL marker, and the next byte offset points into that comment? For instance,

1 0 obj
<< /Type /ObjStm
   /Length 17
   /N 2
   /First 8
>>
stream
2 0 3 6
123 % 456
endstream
endobj

7.5.7 Note 7 suggests that "processing of each object in an object stream starts at the specified byte offset in the decompressed stream and ends prior to the byte offset of the next object or when the end of stream is encountered", which would permit this. But attempting to parse the list of objects in one go would skip object 3.

(There's also another question, of whether an object stream can begin with white-space, since the wording only explicitly permits white-space separating the integers specifying the byte offsets. But this may be adequately implied already by the ordinary syntax rules.)

datalogics-pgallot commented 1 year ago

If you are placing comments in Object streams, you've kind of missed the point of the purpose of object streams.

The purpose of object streams is to compress collections of objects. the more similar the objects are to each other, the more compressible the stream is. Conversely, adding dissimilar content to the object stream reduces the compression savings. Comments won't be visible because the entire stream will be compressed so they would just be useless bloat within the object stream.

datalogics-pgallot commented 1 year ago

Content streams (7.8.2) are stream objects (7.3.8), but not all stream objects are content streams. A 25h that is within a non-content-stream stream object's stream and endstream delimiters is not a token delimiter (see table 2), it's just data.

LegionMammal978 commented 1 year ago

If you are placing comments in Object streams, you've kind of missed the point of the purpose of object streams.

The purpose of object streams is to compress collections of objects. the more similar the objects are to each other, the more compressible the stream is. Conversely, adding dissimilar content to the object stream reduces the compression savings. Comments won't be visible because the entire stream will be compressed so they would just be useless bloat within the object stream.

My purpose isn't to suggest that putting comments in an object stream would be a good idea, it's to clarify the boundaries of what a conforming PDF processor must be prepared to accept.

A 25h that is within a non-content-stream stream object's stream and endstream delimiters is not a token delimiter (see table 2), it's just data.

As I said, the wording at the start of 7.2.4 ("Any occurrence of the PERCENT SIGN (25h)...") does not appeal to the classification of characters into regular, delimiter, and white-space characters, so the classification is irrelevant. If the classification is intended to be relevant, then the wording should be modified to reflect that.

petervwyatt commented 1 year ago

I agree that clarification is needed as to whether PDF comments are valid in object streams.

Current wording would imply they are not since they are not "content streams", but this may be an oversight in clause 7.2.4 rather than a result of any explicit decision. However, pragmatically, I can imagine comments getting included (e.g. for debugging or educational purposes or when converted from traditional body objects) as well as the majority of implementations simply reusing their "standard" PDF object lexer/parsers that must cope with comments. Having to special case object stream parsing seems an unnecessary complexity most would not do which then makes it defacto supported.

PS. The classification of specific bytes is already clearly defined in the subclasses of clause 7.2, including Tables 1 and 2, as well as subclauses in clause 7.3.

Will raise this in the next PDF TWG Meeting to gain consensus.

LegionMammal978 commented 1 year ago

Having to special case object stream parsing seems an unnecessary complexity most would not do which then makes it defacto supported.

7.5.7 Note 7 can be read as suggesting that extra complexity is already present, i.e., that when object processing restarts at each byte offset, this resets the lexical state, terminating a trailing comment with no EOL necessary. (If comments are permitted, even if two strings each represent a syntactically valid object, their concatenation does not necessarily represent a sequence of two objects.) So if it is not intended that the byte offset boundary can terminate a comment, the note should be clarified.

(Also, speaking of Note 7, it states that earlier PDF standards require white-space between objects in the sequence, but I do not see any wording in PDF-1.5 through PDF-1.7 that implies such a requirement.)

PS. The classification of specific bytes is already clearly defined in the subclasses of clause 7.2, including Tables 1 and 2, as well as subclauses in clause 7.3.

Sure, the classification is well-defined, the question is over where it is applicable. Most of the syntactic rules refer to objects, keywords, etc., which are explicitly defined in terms of tokens, so the classification is almost always applicable. But it's not immediately clear that the classification is applicable to 7.2.4, which just refers to byte 25h without referring to the byte in its role as a delimiter. (Although such a comment inside a stream would be very strange, since the terminating EOL marker wouldn't occur until after the end of the stream data.)

petervwyatt commented 1 year ago

The key concept in Note 7 is that the object stream is segmented BEFORE parsing - it is NOT segmented as a result of parsing.

Note 7 is effectively the same as encountering an end-of-stream for normal streams - it won't change the lexical analysis if the current token is within a comment and the stream ends prior to the EOL. Technically this is an error and you'd be right to fail such PDFs, but your customers and other implementations will likely silently recover. Of course, the following object in the object stream (if one exists) may then be a complete mess but that is separate...!

The note re whitespace refers to the first edition of PDF 2.0 (ISO 32000-2:2017) which is now withdrawn and replaced by the 2020 edition.

LegionMammal978 commented 1 year ago

Technically this is an error and you'd be right to fail such PDFs, but your customers and other implementations will likely silently recover.

Alright, so should Note 7 be read as permitting PDF readers to delimit the objects at the byte offsets, without permitting conforming PDF files to contain a sequence of byte offsets for which this segmentation would change the results of parsing, relative to a simple parse of the entire sequence of objects? (E.g., segmenting 123 % 456 into 123 % and 456 would turn one object into two.)

petervwyatt commented 1 year ago

Yes - it also means that 123456 is technically valid and can be processed as 123 and 456 with appropriate byte offsets for 2 objects from the first line of the object stream (and not as 123456 and 456). However, given the 2017 edition this is probably not advisable to write.

LegionMammal978 commented 1 year ago

I'm afraid I don't quite understand what you mean: is this pattern inadvisable, or is it forbidden? Is a PDF processor conforming to ISO 32000-2:2020 required to accept this pattern and correctly produce 123 and 456?

If so, Note 7 feels like a bit of an understatement: an object stream doesn't really contain a "sequence of indirect objects" that would make an array if you surrounded it with [ ], but instead a sequence of substrings delimited by the byte offsets instead of by the ordinary token boundaries.

petervwyatt commented 1 year ago

It is NOT forbidden as this is not stated - only potentially inadvisable because the 2017 edition stated a requirement that was never intended (missed during reviews), so any SW that has not been maintained up to the 2020 edition will work differently. Existing PDFs created against the 2017 edition will work with SW supporting either edition.

And objects streams are, after the "N pairs of integers", definitely a sequence of indirect objects. See 7.3.10 for definition of an indirect object. Within every indirect object in the object stream the standard PDF lexical delimiter rules apply - this is the same lexical rules as if each object were an indirect object in a conventional PDF body section: the object ends just prior to the endobj token. What the requirements effectively state are that the byte offsets into the object stream of indirect objects are precise, non-overlapping, and obeyed by all processors - hence why I said above "object stream is segmented BEFORE parsing - it is NOT segmented as a result of parsing."

petervwyatt commented 1 year ago

I think this discussion has mostly reduced down to this point from the OP:

(There's also another question, of whether an object stream can begin with white-space, since the wording only explicitly permits white-space separating the integers specifying the byte offsets. But this may be adequately implied already by the ordinary syntax rules.)

Later discussion asked if this also applies to comments and before the start of an object in an object stream. I believe the answer is yes to all, but a note in the object stream subclause would not hurt. Trying to patch this into the first few subclauses in 7 about lexical rules has the potential to make things more confusing.

petervwyatt commented 1 year ago

PDF TWG would like to research popular implementations.

MPBailey commented 1 year ago

While looking at this ...

Any occurrence of the PERCENT SIGN (25h) outside a string or inside a content stream (see 7.8.2, "Content streams") introduces a comment.

That would appear to say that a % character inside a string in a content stream introduces a comment, which is clearly not intended.

Also, please be careful not to mess up something like an image stream by treating % as special. I suspect that's what lead to the addition of 'content' in the text.

petervwyatt commented 1 year ago

Yes, that phrasing is really quite bad!

Further problems when you consider that a comment is also whitespace:

I'd suggest the only way to correct this is to replace the sentence with an expanded bullet list...

LegionMammal978 commented 1 year ago

Indeed, the interpretation can be further extended into absurdity. Such a comment in a stream would continue "up to but not including the end-of-the-line marker" (7.2.4). However, character classification is explicitly not applicable to characters inside streams (7.2.3), including classification as an EOL marker. Therefore, the next EOL marker can only come after the stream data, and the comment would gobble up everything up to the endstream!

That aside, I noticed that it's called an "end-of-the-line marker" in 7.2.4, but an "end-of-line marker" everywhere else; it might be good to replace it there with the more common (and formally defined) term.

mkl-public commented 10 months ago

To prevent this issue from being forgotten, here a first proposal of change of the start 7.2.4 which essentially reverts the sentence to its ISO 32000-1 form and then adds (decoded) content streams as another place where comments may occur.

I.e. replace

Any occurrence of the PERCENT SIGN (25h) outside a string or inside a content stream (see 7.8.2, "Content streams") introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the end-of-the-line marker.

by

Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces a comment. Furthermore, any occurrence thereof inside a decoded content stream (see 7.8.2, "Content streams") (except in string arguments of instructions or inlined image data) also introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the next end-of-line marker or (in case of comments in content streams) end of stream.

petervwyatt commented 10 months ago

I like where you are going - but the words need massaging...

PendantsRUs: Are % comments also allowed in Type 4 PostScript functions? Are they therefore PS comments or PDF comments?

mkl-public commented 10 months ago

Some massaging I expected... ;)

IMO we here (in PDF / Syntax / Lexical conventions) should address only genuine PDF comments, not parts of data that are processed according to their own, external specifications like PS or JavaScript and there are considered comments.

Actually one could reduce the text here even more and use the up-to-ISO 32000-1 formulation (outside a string or stream) and only add that additional comments in more specific contexts (e.g. decoded content streams) may be specified elsewhere in this spec. Then one would add a small section on comments to the Content Streams section 7.8.

And maybe to the object stream section for the stream content after the First offset if that's desired. (Needs some fine tuning.)

But most likely not to the xref stream section.

petervwyatt commented 8 months ago

For PDF TWG review: see @mkl-public comment of Aug 30

MPBailey commented 8 months ago

Just to note that a PostScript calculator function is a PDF construct that happens to borrow a few bits from PostScript; it's not defined in an external specification. And 7.10.5.1 explicitly notes that the function can include comments, without any details of how they are delimited. FWIW mkl-public's Aug 30 suggestion would work well here; add those details somewhere in 7.10.5 rather than complicating some other section.

petervwyatt commented 8 months ago

1st line of an object stream: MAY have a comment (whitespace) between tokens of "N-pairs" of an object stream: MAY include a comment CANNOT have comments in cross-ref streams (effectively binary data based on W) Inside a PS Type4 function: MAY include a comment (noting @MPBailey comment above)

Review where to insert appropriate wording changes (normative or informative)

petervwyatt commented 8 months ago

Proposed solutions for the above cases, all done informatively via notes:

This still does not address the core problems with the first sentence of 7.2.4 Comments as identified by the OP:

"Any occurrence of the PERCENT SIGN (25h) outside a string or inside a content stream (see 7.8.2, "Content streams") introduces a comment."

which needs to be reworded (a bulleted list aids readability):

The occurrence of a PERCENT SIGN (25h) delimiter introduces a comment in the following situations:

  • inside a content stream (see 7.8.2, "Content streams") when not inside a literal string object
  • inside an object stream (see 7.5.7 "Object streams") when not inside a literal string object
  • outside a literal string object (see 7.3.4.2 Literal strings)

NOTE: Type 4 function streams may contain comments according to the semantics of the PostScript Language Reference - see 7.10.5.2, "Operators and operands"

mkl-public commented 8 months ago

Concerning the proposed rewording at the end:

I'm not sure but is there a need to stress that by "inside a xxx stream" we mean "inside a xxx stream after applying all filters," i.e. in the uncompressed, unencrypted stream?

Concerning item 3, "outside a literal string object (see 7.3.4.2 Literal strings)," we IMO need to extend this to "outside streams and literal string objects (see 7.3.4.2 Literal strings)", otherwise item 3 would allow comments in arbitrary streams.

Furthermore, you stress literal strings. Does this mean you would allow comments in hexadecimal strings like this:

<48% "H"
45% "E"
58% "X"
>
petervwyatt commented 8 months ago

Thanks for the feedback.

Yes - definitely after applying filters. Arguably "7.5.7 Object streams" and "7.8 Content streams and resources" both already account for this by the way things are described in those clauses. What is slightly more confusing is that if those streams were encoded with AsciiHex or Ascii85 then whitespace is also ignored on the compressed (unfiltered) data for those specific filters, and comments are treated as whitespace...

Re item 3: my wording with "delimiter" in the opening sentence was an attempt to avoid arbitrary stream data (either filtered or unfiltered) that is not expressed in PDF tokens. A % sign in a JPEG or FLATE stream is definitely NOT a comment!

To answer your last question: yes, because hex strings explicitly state "White-space characters (see “Table 1 — White-space characters”) shall be ignored." - and comments are treated as whitespace.

Or at least I don't see any statements that contradict those interpretations - and more than a few implementations appear to work this way.

PS. This also means that literal strings and hex strings (i.e. all strings) are the only PDF tokens that can span multiple lines.

mkl-public commented 8 months ago

Re item 3: my wording with "delimiter" in the opening sentence was an attempt to avoid arbitrary stream data (either filtered or unfiltered) that is not expressed in PDF tokens. A % sign in a JPEG or FLATE stream is definitely NOT a comment!

Hhmm, I'd consider that too indirect, too easy to miss.

Also in the section on delimiter characters there is "The delimiter characters (, ), <, >, [, ], /, and % are special ... . They delimit syntactic entities such as arrays, names, and comments." Thus, the special thing about a '%' delimiter character is that it delimits a comment.

If you now try to make use of the '%' being a delimiter character when specifying where comments are allowed, you have something like a circular specification: A '%' is a delimiter character where it delimits a comment and a comment may occur where the '%' is a delimiter.

LegionMammal978 commented 8 months ago

If you now try to make use of the '%' being a delimiter character when specifying where comments are allowed, you have something like a circular specification: A '%' is a delimiter character where it delimits a comment and a comment may occur where the '%' is a delimiter.

I don't read 7.2.3 as being circular in that way. It starts off by defining the three character classes:

The PDF character set is divided into three classes referred to as regular, delimiter, and white-space characters. This classification enables the grouping of characters into tokens including separating syntactic constructs such as names and numbers from each other. The rules defined in this subclause apply to all characters in the file except within strings, streams, and comments.

The paragraph you quoted starts off by naming % as one of the delimiter characters, and it ends by referring to Table 2 for a list of delimiter characters, and Table 2 similarly contains the percent sign. So I'd say it pretty unambiguously says that the percent sign is a delimiter character, in all contexts where the character classification applies.

Then, given all the delimiter characters, the paragraph you quoted defines that they, alongside the "double character constructs" << and >>, delimit syntactic entities. Further, a bit implicitly, it seems to call a delimiter any sequence of characters that delimits syntactic entities.

(Though I do have one nitpick, aside from the implicit definition: the paragraph calls { and } "additional delimiter characters" within Type 4 functions, but Table 2 lists the two alongside the rest of the unconditional delimiter characters. Could the unqualified statement "'Table 2 — Delimiter characters' shows the delimiter characters." be read as implying that { and } are delimiter characters and not regular characters even outside of Type 4 functions?)

petervwyatt commented 8 months ago

If you see { and } in the delimiter sentence then you are using an OUTDATED PDF SPECIFICATION! This was fixed in ISO 32000-2:2020! Please update immediately!

LegionMammal978 commented 8 months ago

If you see { and } in the delimiter sentence then you are using an OUTDATED PDF SPECIFICATION! This was fixed in ISO 32000-2:2020! Please update immediately!

I am viewing "International Standard ISO 32000-2:2020 (PDF 2.0), Second edition: 2020-12, Includes errata from ISO 32000-2:2020/Amd 1". Within this document, subsection 7.2.3 says:

The delimiter characters (, ), <, >, [, ], /, and % are special ([...]). They delimit syntactic entities such as arrays, names, and comments. The delimiter characters { and } ([...]) are additional delimiter characters within Type 4 PostScript calculator functions (see 7.10.5 "Type 4 (PostScript calculator) functions"). [...] "Table 2 — Delimiter characters" shows the delimiter characters.

And "Table 2 — Delimiter characters" contains two entries for { and } alongside the other delimiter characters:

Glyph Decimal Hexadecimal Octal Name
{ 123 7B 173 LEFT CURLY BRACKET
} 125 7D 175 RIGHT CURLY BRACKET

I only mean to say that the final sentence, saying that Table 2 shows the delimiter characters, could be construed as expanding the scope of { and } to become unconditional delimiter characters, since they are listed in the table; Table 1 acts as a source of truth for the set of white-space characters, so Table 2 might be seen as a parallel.

mkl-public commented 8 months ago

Well, maybe it's not a circular definition as I summarized it. (I still have a gut feeling of some circularity there somewhere, though.)

Nonetheless, I still think that Peter's attempt to avoid arbitrary stream data (either filtered or unfiltered) that is not expressed in PDF tokens by his wording with "delimiter" in the opening sentence would make this very easy to misinterpret and an added "and outside streams" would prevent that.

petervwyatt commented 7 months ago

If we're being pedantic then the stream extent dictionary of a stream object can contain PDF comments so "stream object" is not correct - it's only the defiltered stream data between stream and endstream keywords that don't. Comments also cannot appear between xref and trailer keywords (see existing NOTE 2 in 7.5.4).

Circling back to this. Second attempt at clarifying the first sentence of 7.2.4 Comments (as identified in the OP):

The occurrence of a PERCENT SIGN (25h) delimiter introduces a comment in the following situations:

  • inside a content stream (see 7.8.2, "Content streams") after applying all filters and when not inside a literal string object
  • inside an object stream (see 7.5.7 "Object streams") after applying all filters and when not inside a literal string object
  • outside literal string objects (see 7.3.4.2 Literal strings), outside stream data after applying all filters (see 7.3.8 Stream objects) and not between xref and trailer keywords (see 7.5.4 Cross-reference table).

NOTE: Type 4 function streams may contain comments according to the semantics of the PostScript Language Reference - see 7.10.5.2, "Operators and operands".

But I don't like the wording of that last bullet... maybe this is easier done as references to these other sections where other comment-related requirements exist rather than trying to cram fully-aligned complexity here??

mkl-public commented 7 months ago

Well, you can drop the "after applying all filters" from item 3, this item does not refer to percent signs in stream data before applying filters either.

Alternatively one can express this along these lines which to me look a bit less crammed:

Any occurrence of the PERCENT SIGN (25h) outside literal string objects or stream data and not between xref and trailer keywords introduces a comment. Furthermore, a PERCENT SIGN (25h) also introduces a comment:

  • inside a content stream (see 7.8.2, "Content streams") after applying all filters and when not inside a literal string object, and
  • inside an object stream (see 7.5.7 "Object streams") after applying all filters and when not inside a literal string object.

This would first cover everything outside streams, the only case originally covered here. And as an add-on it would then cover the additional cases in stream content after filtering.

lrosenthol commented 4 weeks ago

i don't see a final proposal here. Remove the label?

petervwyatt commented 3 weeks ago

Based on @mkl-public proposal from Nov 21, 2023 but trying to clarify the order of precedence and sprinkling in some ISO "shall"s:

Any occurrence of the PERCENT SIGN (25h) outside literal string objects and outside stream data, and not between xref and trailer keywords, shall introduce a comment. Furthermore, a PERCENT SIGN (25h) shall also introduce a comment when:

  • inside a content stream (see 7.8.2, "Content streams") after applying all filters and when not inside a literal string object, or
  • inside an object stream (see 7.5.7 "Object streams") after applying all filters and when not inside a literal string object.