pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
63 stars 2 forks source link

Cyclic vs acyclic linked list chains via dictionary keys is ambiguous #102

Open petervwyatt opened 3 years ago

petervwyatt commented 3 years ago

Many keys in PDF dictionaries are designed to create "linked lists" of objects of the same dictionary type. However, ISO 32000-2:2020 is commonly written such that the clarity of when cycles in such linked lists are and are not allowed is unclear. In many cases, this can only be inferred from what the dictionary data structure represents and whether a cycle makes any logical sense in that context. Cycles are also a common cause of crashes so clearly this is ambiguous and not commonly understood.

The word "acyclic" is used just once (Table 16, Extends key) to explicitly prohibit such cycles:

A given collection consists of a set of streams whose Extends links form a directed acyclic graph.

The phrase "infinite cycle" is used just once, in the NOTE under 12.6.4.4:

NOTE It is an error for a target dictionary to have an infinite cycle (for example, one where a target dictionary refers to itself). Interactive PDF processors need to attempt to detect such cases and refuse to execute the action if one is found.

It would be good if PDF 2.0 used consistent terminology/requirements for when cycles are and are not permitted. This may be as simple as adding simple statements such as "Cycles are permitted", "Cycles shall not be present"/"Data shall be acyclic" or similar...

This issue is to identify and gather those keys/dictionaries which need improved clarity regarding their cyclic / acyclic requirements.

petervwyatt commented 3 years ago

Table 172, Markup Annotations, IRT key (In Reply To): linked list of annotations. AFAICT makes no logical sense if cycles are permitted so should be acyclic.

lrosenthol commented 3 years ago

There are also scenarios where cycles are permitted and logical but implementers forget to address that particular scenario in that implementation (walking dictionaries with /Parent or /P keys being the most common).

petervwyatt commented 3 years ago

Yeap - but to me "back pointers" and "page pointers" are obvious places where a multitude of cycles will get introduced. They are mostly through a variety of keys and dictionaries (so the heterogeneous case) but, as you say, people still get this stuff incorrect so being more explicit won't hurt.

I was really focused on the less obvious homogeneous cases where lists of the same dict should or should not cause cycles. e.g. IRT in markup annots.

bdoubrov commented 2 years ago

I think this question primarily concerns tree-like structures in PDF. I found eight (!) kinds of them:

What we would like to ensure is that these structures do not contain cycles in the terminology of graphs.

The exact entries used in these structures to define parent-child relationship may differ. So, I would suggest to word this generically rather than adding some specific requirements for some entry values.

pesco commented 2 years ago

I have been looking for any wording that would forbid infinite cyclic references such as:

8 0 obj
8 0 R
endobj

What should be the value of such an object?

petervwyatt commented 2 years ago

That is already covered in a few places:

petervwyatt commented 2 years ago

@pesco so what you suggest is perfectly valid as an indirect object but only in objects in a traditional file body (no object streams). And the note warns developers to potentially expect cycles which in a number of features are explicitly required (e.g. outlines) or prohibited (e.g. object stream Extends)

pesco commented 2 years ago

Right, so cyclic/infinite data structures are valid. Is the degenerate case of an object defined completely as itself also valid? If so, what is its type and value, in the same semantic sense that an indirect reference to, say, a number object is considered equivalent to that number object? Is it the null object? If so, is a reference to such an object valid wherever a null object is valid?

lrosenthol commented 2 years ago

Is the degenerate case of an object defined completely as itself also valid?

Do you mean just a simple direct object, such as 1 or (foo)?

or an indirect object version of each of those, such as:

1 0 obj
(foo)
endobj

in the same semantic sense that an indirect reference to, say, a number object is considered equivalent to that number object?

"equivalent" in what context? In the file format, they are not the same thing - there are direct objects and indirect objects and references. Only in application implementations - which are undefined - is there the consideration that references to indirect objects might be "equivalent", but that is out of scope.

petervwyatt commented 2 years ago

7.3.10 explicitly states "Except where documented to the contrary, any object value may be a direct or an indirect reference; the semantics are equivalent."

And by degenerate did you mean the following?:

10 0 obj
endobj

I cannot find explicit words covering this "empty object", but I think this would reduce to an implicit null object if it were the value of a dictionary key (for example). You cannot assume it is a particular type of PDF object so null seems the only logical conclusion.

pesco commented 2 years ago

Yes, I meant equivalent in the sense of exactly the sentence you quoted, Peter, and I was talking about the example I gave of an (indirect) object being defined as a reference to itself. I called that a degenerate case of a cyclic data structure in that there is no data there.

My inner computer scientist says that the semantic value of this object should either be undefined, i.e. the object invalid, or null, i.e. a reference to nothing.

There is already this sentence in 7.3.10:

An indirect reference to an undefined object [...] shall be treated as a reference to the null object.

Which could be amended with something like "An indirect object cyclically defined (via indirect references) as itself shall be treated as the null object."

lrosenthol commented 2 years ago

@pesco - I haven't validated this as such, but I would consider your example below as invalid

8 0 obj
8 0 R
endobj

Not because it refers back to itself, but because there is no object there. I would consider a more normal version also problematic/invalid...

8 0 obj
9 0 R
endobj

9 0 obj
(string)
endobj
petervwyatt commented 2 years ago

I think the difference here is between something that is lexically/syntactically valid (all of @lrosenthol examples and my degenerate "empty object" example) and those which are semantically valid (e.g. the last 2 examples from @lrosenthol). But this runs very close to making certain assumptions about when a processor/parser makes such determinations - and we avoid this in PDF standards.

So although I agree with @lrosenthol conclusions, pointing to specific language in the PDF spec is a different matter - as is creating new wording that doesn't imply specific implementation or design choices.

pesco commented 2 years ago

@lrosenthol If indeed

8 0 obj
8 0 R
endobj

is invalid, which was my first intuition, the spec does not make this clear. Your second example (an acyclic chain of references), however, is quite explicitly allowed by one of the statements Peter quoted earlier ("PDF syntax thus permits chains of [object references]").

@petervwyatt Your "empty object" is certainly not syntactically valid, unless your definition of syntax is extremely bare. Quoting 7.3.10 "Indirect objects":

The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white-space), followed by the value of the object bracketed between the keywords obj and endobj.

Emphasis mine. "The definition ... shall consist of ... (separated by white-space) ... followed by ..." is clearly a syntactic rule in the common sense of the word.

But you are right that the original question (concerning a purely cyclic object definition) is not a syntactic one. Hence my asking what the (semantic) value and/or type of an object like 8 0 in the example above should be.

lrosenthol commented 2 years ago

Your second example (an acyclic chain of references), however, is quite explicitly allowed by one of the statements Peter quoted earlier ("PDF syntax thus permits chains of [object references]").

It's not - because the first object (8) is invalid in the same way that other is. The fact that it is referring to itself is NOT the invalid part - it's the reference doing nothing...

PDF doesn't really have this type of referencing - what I might call "aliasing".

FWIW: I am trying to confirm how some "well known implementations" react to this scenario...

lrosenthol commented 2 years ago

OK - I put together a quick sample file (attached) and tried a few known implementations.

lrosenthol commented 2 years ago

So why does Acrobat fail on the file...

Our parser is designed to only expect to find indirect references in certain places (i.e. as entries in an array and as the values of dictionary entries). An indirect reference like this is therefore unexpected. I will note that this behavior was added to our parser 20 years ago (2002, if anyone cares) in order to address bad files coming out of a popular producer of the time.

If anyone cares about more specifics, using the example PDF, the parser reads the value as the integer object 7 (the objID in the 7 0 R). BUT then the parser sees the next "token" in the stream (the 0) and it is not something it expects, and so it (correctly) throws an error.

petervwyatt commented 2 years ago

Replying to @lrosenthol (my definition of works is that it displays "Hello World"):

petervwyatt commented 2 years ago

Replying to @pesco: yes, my meaning was meant in a very "bare" sense, as you put it, since the PDF file format spec cannot assume anything about lexer/parser design: breadth-first, depth-first, lazy, contextual vs context-free, etc.

I think we can definitely improve the wording around what the "value" part of an object means, what is valid/invalid, and how this may or may not "reduce" to an implicit null object.

petervwyatt commented 2 years ago

Correcting for wrong comment in #208: Follow up from PDF TWG discussions where we noted that some PDF software don't read /Length keys (they instead work it out for themselves): I made a new version of the test PDF HelloWorld_v2.pdf but with the chain of indirect references for the page content stream.

Results from my testing are: