pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
63 stars 2 forks source link

Syntax of content stream operators between BX/EX #202

Open pesco opened 2 years ago

pesco commented 2 years ago

The syntax for content stream operators is not explicitly defined. This is a problem with compatibility sections as introduced by the last paragraph of 7.8.2 "Content streams":

They bracket a compatibility section, a portion of a content stream within which unrecognised operators shall be ignored without error.

It is not clear if this means that the following would be allowed:

BX 1 2 \o EX

Clearly 1 and 2 are objects and thus operands, followed by \o, a sequence of regular characters that is not an object. Paragraph 5 of 7.8.2 notes:

An operator keyword shall be distinguished from a name object by the absence of an initial SOLIDUS character (2Fh) (/).

This sounds like operator keywords might be meant to follow the same syntax as name objects after the initial solidus but stops short of saying so.

lrosenthol commented 2 years ago

Depends on what you mean by "would be allowed".

Is that fragment BX 1 2 \o EX allowed to be present in a content stream? Yes, it is. Why? Because nothing specifically prohibits it. It's conceptually the same as having this fragment

 1 2 \o (A String) Tj

It's just a series of valid PDF objects followed by an operator.

With respect to processing such content - that is a separate question which is not entirely spelled out by ISO 32000... on purpose, because it would force specific implementation decisions. (i.e. is the content stream parser stack based or not)

pesco commented 2 years ago

What kind of object is \o? Unless I am missing something, it is not a valid object, so it cannot be an argument. But can it be an operator? What else can be an operator?

The issue is that the spec does not specify what operators look like.

petervwyatt commented 2 years ago

I agree the spec doesn't specify exactly what an operator looks like.

We leave to inference from lexical conventions, the whitespace list, the token delimiter list, the definition of basic PDF objects, etc what "all other things" might be from a parsing/tokenization PoV. Effectively an operator is the same but they only occur inside content streams (as the content stream dialect also prohibits indirect references it is slightly different to outside content streams).

So in your BX/EX example, \o should parse as a single token and would be seen as an operator with 2 integer operands on the stack (1 and 2). Because its between BX/EX this unknown thing (supposedly as operator since it didn't parse as anything else) should be skipped over...

But a lot of this is unstated directly...

pesco commented 2 years ago

So what's the intent? An operator could be any token that is not something else? A more direct definition might be helpful, even if it were more restrictive (like the syntax of names sans solidus).

lrosenthol commented 2 years ago

A more direct definition might be helpful

It might be - but you are almost 30 years too late for doing so.

even if it were more restrictive

Why would we do that? That would make existing PDF files no longer compliant and that isn't good business...

pesco commented 2 years ago

I am not trying to be argumentative. I have no idea what the "in the wild" reality of this syntax is, so was considering it entirely possible that the reality had never been anywhere near as generic as "any token at all that isn't otherwise recognized" but rather more closely to what's usually used for keywords (which operators are defined to be) - namely something like alphanumeric strings or strings of regular characters if you wanted to be generous.

In any case, as Peter wrote, the spec leaves this implicit and it should probably spell it out.

petervwyatt commented 2 years ago

This and other related/interlinked issues (#194 #199 #201 #208 #209) are being discussed in the "Securing PDF" DG of ISO TC 171 SC 2 WG 8 and will be labelled as "Parked" here in GitHub until such time as a set of solutions can be proposed.

petervwyatt commented 3 months ago

See Errata #363 - I think that by formally defining "PDF keywords" at a lexical level that flows through and then also resolves this issue.

petervwyatt commented 1 month ago

Will be resolved as part of the resolution to Errata #363.