pdf-association / arlington-pdf-model

A vendor- and implementation-independent specification-derived, machine-readable model of PDF.
Apache License 2.0
77 stars 6 forks source link

Metadata and AF entries permitted in any PDF object #65

Closed bdoubrov closed 1 year ago

bdoubrov commented 1 year ago

PDF 2.0 permits Metadata and AF entries in any PDF object. Currently the model only specifies these two keys only for the dictionaries, where they are explicitly present. But ISO 32000-2 (14.3 - Metadata, 14.13 - Associated files) allows the use of these entries even if they are not explicitly mentioned.

Is this something that has to be addressed somehow in the Arlington model as well? Either as a special clause saying that these two keys are permitted everywhere, or even explicitly adding them to all objects?

petervwyatt commented 1 year ago

You are correct with your comment: these keys are in the data model only where they are explicitly mentioned in ISO 32K.

In the TestGrammar (C++) PoC, I explicitly coded to check all dictionaries and to report as INFO messages when they are encountered but are not in the Arlington model (i.e. are not explicitly stated in 32K). They are not errors, but in my mind they have a different "level" of official-ness than private / undocumented entries.

A similar discussion might also be made where some dictionaries are defined in 32K to explicitly allow arbitrary key names (which I encoded as * keys in such dicts), as any dictionary can have any key simply because PDF is extensible by design.

Do you have a preference?

bdoubrov commented 1 year ago

I really find important that all requirements of the Arlington model are transparent as either a part of tsv grammar or maybe as some extra documentation clauses, for example, in INTERNAL_GRAMMAR.md.

For example, it would be great if there was a list of all cases which are treated with different level of severity. Ideally, defined in a machine syntax similar to tsv files, but at least unambiguously documented. This would leave no space for different interpretations of the Arlington model by different people / implementations.

Then having AF and Metadata entries reported as INFO messages, where they are not explicitly declared, would certainly make sense.

petervwyatt commented 1 year ago

I'll make a new MD file called MODEL_NOTES (it's not really the internal grammar per-se). This can document such things as well as known model limitations, etc.