pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
65 stars 2 forks source link

Dictionary entries with null values #199

Open pesco opened 2 years ago

pesco commented 2 years ago

Description

7.3.7 "Dictionary objects" makes the following two statements:

(A)

A dictionary entry whose value is null (see 7.3.9, "Null object") shall be treated the same as if the entry does not exist.

(B)

Multiple entries in the same dictionary shall not have the same key.

This does not clearly define the semantics of something like the following:

<< /Foo null /Foo null >>

Is this forbidden by rule (B) or allowed (equivalent to << >>) due to rule (A)?

Additional Context

Note that null values can occur implicitly due to unresolvable references, so if the above example were allowed as << >>, rule (B) could not be verified in general without full cross-reference resolution and unknown "cross-reference stream entry types" (cf. #194) would mean that the following could be valid or invalid depending on which version of PDF the reader supports.

<< /Foo 666 0 R /Foo (bar) >>

Suggestion

Clearly define rule (B) as taking precedence over rule (A) by moving it into the first paragraph of section 7.3.7 and wording it more directly.

A dictionary object is an associative table containing pairs of objects, known as the dictionary's entries. The first element of each entry is the key and the second element is the value. The key shall be a name (...) and each key shall occur only once.

In addition, rule (A) might be augmented to explicitly rule out any confusion.

A dictionary entry whose value is null (see 7.3.9, "Null object") shall be treated semantically the same as if the entry does not exist. Note that multiple entries for the same key shall still not occur, even with null values.

petervwyatt commented 2 years ago

ISO 32000 is a file format specification to define what valid PDFs must ("shall"!) be. So as soon as a PDF violates a mandated "shall" requirement (such as 2 keys in a dictionary with the same name) then how that PDF is to be interpreted is beyond the scope of ISO 32000. Note also the use of the term "key" so this requirement is clearly not inclusive of the value of the key as described by T&D 3.16 for "dictionary object". And there is no conceptual 'hierarchy' of "shall" requirements - all requirements need to be met.

So because your example dict has 2 /Foo keys it fails a mandated requirement for PDF and thus is not a valid PDF. End of story. 😀

PS. Although not a formally defined term, we also try to use the term "entry" to describe a key/value pair, but that not be consistent everywhere.

pesco commented 2 years ago

So because your example dict has 2 /Foo keys

Does it? Rule (A) clearly states that I should treat these entries as though they did not exist, so what I wrote is clearly equivalent to the empty dictionary which does not have any duplicate keys.

As I read it, the spec is not clear that rule (B) should be applied to the keys as written rather than the list of keys that remains after the application of rule (A).

It doesn't help that rule (B) is mentioned after the semantic description, so it's easy to misread the context.

petervwyatt commented 2 years ago

This topic is now being discussed in the ISO TC 171 SC 2 WG 8 "Securing PDF" discussion group. This discussion group will make recommendations back to this errata at the conclusion of those and other related discussions.

We agree that it is not clear that the vast majority of ISO 32000-2 requirements apply as file format requirements for valid files (i.e. bytes in a valid PDF file) and not processor requirements (i.e. not a DOM) - and certainly not under error recovery conditions. We do also very much appreciate that the majority of PDF processors will support such PDFs, even without an official position! So flagging this issue as parked for now.

@pesco Hopefully you have enough information not to hold you up.

petervwyatt commented 2 years ago

This and other related/interlinked issues (#194 #201 #202 #208 #209) are being discussed in the "Securing PDF" DG of ISO TC 171 SC 2 WG 8 and will be labelled as "Parked" here in GitHub until such time as a set of solutions can be proposed.

dudasl commented 1 year ago

What I get from this discussion is that @pesco argues << /Foo null /Foo null >> does not violate PDF specification because processors when reading the dictionary should ignore first /Foo null and then read second /Foo null (and again ignore it 😊). If we apply rule A first and rule B after then we must say it's equivalent to << >> and not violates PDF specification. But writers of PDF specification probably never think about this in this way. Probably they assume both have the same weight and both rules A and B should be applied to the syntax of the dictionary. Here is a complex table describing all possibilities

A equal B A then B B then A2
<< /Foo null /Foo null >> 1 ✔️ (same as << >>)
<< /Foo (bar) /Foo null >>
<< /Foo null /Foo (bar) >> 1 ✔️ (same as << /Foo (bar) >>)
<< /Foo (bar) /Foo (foo) >>

Notes to table:

  1. Questionable because we (as a human) are thinking both rules apply at the same time but machines must apply in some order. And it depends on the PDF processor what happens first
  2. In case B than A It depends on the fact how the PDF processor reads the dictionary. There are several options. For examples:
    • It may read all PDF objects in the dictionary and then validates if all keys are distinguished. This is more memory intensive but works.
    • It can read two PDF objects (key /Foo + value (bar)) and add them to the internal map. Again read the pair of PDF objects (/Foo + null). As the value is a Null object it may not process this pair (to speed up the reading process).
    • various others...

I worked on several low-level PDF object processors and the implementations are different. Some tokenize file content first and then process. Some processing byte to byte. Other processors may have syntax checks at the beginning some do not. And so on... Regarding what I learn for almost 20 years in PDF is that PDF specifications do not want to dictate how PDF processors will process. It's similar to state/federal laws. Laws are general rules we should follow but someone needs to explain the rules. And PDF Association is trying to explain these misconceptions they may rise.

My opinion: In all cases, these files are not ISO-32000 compliant and it's on PDF processors if/how they deal with these situations. As I know programmers, they use some type of map and do not take too much care about what happens in these situations. Map somehow deals with that (may not add, raise an exception, overwrite, ...). 😄

Sorry about long comment.

petervwyatt commented 1 year ago

@dudasl Your logic and understanding are totally aligned with the ISO committee. We are trying to develop some guidance for all readers of PDF ISO standards so that everyone arrives at this common understanding of what to expect and not expect from PDF ISO standard wording.

pesco commented 1 year ago

@pesco argues << /Foo null /Foo null >> does not violate PDF specification

I do not. I posed it as a question, specifically to avoid taking any particular stance. I wanted to point out that the text of the specification can reasonably be read in one or the other sense and therefore needs clarification.

petervwyatt commented 4 months ago

Trying to progress... and also noting that the PDF Association recently published my article on how to interpret ISO-ese:

Suggest to simply augment the existing requirement "Multiple entries in the same dictionary shall not have the same key." to state: "Multiple entries in the same dictionary shall not have the same key, including when the value is null."

This then avoids all order-of-operation considerations about "treated the same as if the entry does not exist".

petervwyatt commented 4 months ago

PDF TWG would like the following direction "PDF Processor shall treat it as if..." for the processor and maybe a NOTE. Wordsmith later for review.

pesco commented 4 months ago

i.e. "no duplicate keys" as a file format requirement and "null = not there" as a processor requirement? sounds good to me.

Many thanks for your article and explanations. FWIW, I think some of the confusion indeed comes down to terminology. I say "semantics", you say "processor requirement". I say "syntax" or (above) "as written", you say "file format requirement". May not be an exact match, but you get the gist. I (personally) also tend to think denotationally wrt. semantics ("What value is a syntactic structure mapped to, abstractly?"), whereas "processor requirements" are (usually?) explained in terms of operations ("What should the software do?").