pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
66 stars 2 forks source link

File identifiers (14.4): Observations and possible improvements #328

Closed stechio closed 5 months ago

stechio commented 1 year ago

SUB-ISSUE 1: File identifiers definition

Subclause 14.4 (File identifiers) states:

The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the PDF file at the time it was originally created [...]

To me, the requirement of being based on the contents of the PDF file is arbitrary and misleading; the fact that, since PDF 2.0, even the suggested MD5-based hashing algorithm has dropped its main content-related input ("The values of all entries in the file's document information dictionary [...]" (see PDF 1.7)) speaks volumes — now it retains merely "The size of the PDF file in bytes" as a barely content-related input!

As later stated, the sole relevant attribute of a computed identifier is uniqueness (which implies the statistical robustness of the generative algorithm against collisions):

PDF writers should attempt to ensure the uniqueness of file identifiers.

Even the NOTE to subclause 14.4 in PDF 1.7 stressed that "all that matters is that the identifier is likely to be unique"... What about standard generative algorithms like, say, UUID as alternatives to the suggested MD5-based hashing algorithm? I do not advocate to dismiss the suggested hashing algorithm, just to reword subclause 14.4 in order to make more clear the distinction between specification (identifier uniqueness) and implementation (algorithms suitable to meet the specification).

Therefore, IMO, the sentence in subclause 14.4 should be reformulated, like so: The value of this entry shall be an array of two unique byte strings, each at least 16 bytes long. The first byte string shall be a permanent identifier, not to change when the PDF file is updated. The second byte string shall be a changing identifier, computed when the PDF file is updated (see 7.5.6, "Incremental updates").

While the last paragraph ("PDF writers should attempt to ensure the uniqueness [...]") would be replaced by: Identifier uniqueness should be attempted by employing a suitable generative algorithm, such as UUID (described in Internet RFC 4122); this may also be achieved by means of a message digest algorithm such as MD5 (described in Internet RFC 1321), using the following information: - the current time; - a string representation of the PDF file's location; - the size of the PDF file in bytes.


SUB-ISSUE 2: File identifiers mapping to XMP metadata

Apparently, there is a lack of documentation at core PDF level regarding the relation between PDF file identifiers and document identifiers in XMP metadata: while Table 349 in subclause 14.3.3 (Document information dictionary) suggests a precise mapping between information entries and document-level XMP metadata, there is no indication for the reconciliation between file identifier's (permanent and changing) byte strings and semantically-corresponding document identifiers (DocumentID and InstanceID) in Media Management namespace. Although higher-level PDF specs in ISO stack (such as PDF/A) are designed to add constraints atop the core PDF spec, the latter could nonetheless provide a general mapping suggestion the same way it already does for the document information dictionary entries (after all, even the mapping of those entries is mentioned both at core PDF and PDF/A levels!).

In XMP metadata, document identifiers are typed as GUID, which the XMP spec (part 1 (ISO 16684-1:2011), Annex A) describes eloquently (emphasis is mine):

This document defines three GUIDs that are intended to help manage copies of a resource [(xmpMM:DocumentID, xmpMM:InstanceID, xmpMM:OriginalDocumentID)], to identify a specific state when desired, and to associate related copies of the same conceptual resource. [...] The use of robust GUIDs is encouraged; having globally unique values is important. In practical terms, this means that the probability of a collision is so remote as to be effectively impossible. Typically, 128-bit or 144-bit numbers are used, encoded as hexadecimal strings. This document does not require any particular methodology for creating a GUID, nor does it require any specific means of formatting the GUID as a simple XMP value. The only valid operations on XMP IDs are to create them, to assign one to another, and to compare two of them for equality. Comparisons use the Unicode string value as-is, using a direct byte-for-byte check for equality. IETF RFC 4122 ( http://www.ietf.org/rfc/rfc4122.txt ) describes ways to create and format GUID strings. For privacy, the use of a MAC address is not recommended. See section 4.1.6 of RC 4122 for details and alternatives.

According to the description here above, PDF file identifier byte strings seem compatible with document identifiers in XMP Media Management namespace — furthermore, AFAIK, PDF/A allows freedom of identification scheme (identifiers may be externally based (eg, ISBN) or internally based (eg, UUID)). Could it be acceptable in general cases (ie, without specific constraints) to assign the permanent file identifier byte string of a given PDF file to the corresponding xmpMM:DocumentID property, and its changing file identifier byte string to the corresponding xmpMM:InstanceID property?

This way, it would be possible to have, by default, a single pair of identifiers, without unnecessary redundancies, like so (byte strings here below are obviously fictitious):

trailer
<<
  /Size 5275
  /Root 118 0 R
  /Info 2751 0 R
  /ID [<33333333333333333333333333333333><44444444444444444444444444444444>]
  /Prev 259675
>>
. . .
2152 0 obj<< /Subtype /XML /Length 4751 /Type /Metadata >>
stream
<?xpacket begin="" id="XXXXXXXXXXXXXXXXXXXXXXXX"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:xmpMM='http://ns.adobe.com/xap/1.0/mm/'
  xmpMM:DocumentID='33333333333333333333333333333333'
  xmpMM:InstanceID='44444444444444444444444444444444'/>
. . .
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj
petervwyatt commented 5 months ago

Sub-Issue 1

I think the phrases that mention "the contents" are what is causing confusion - this is not supposed to be referring to the human-interpretable content in a PDF document, but the actual bunch of bytes that make up the file. I think they're unnecessary and simply restating without that term addresses the issue:

"The first byte string shall be a permanent identifier based on the contents of the PDF file at the time it was originally created and shall not change when the PDF file is updated. The second byte string shall be a changing identifier based on the PDF file ’s contents at the time it was last updated (see 7.5.6, "Incremental updates")."

In regards to the mention of MD5 in the last para of 14.4: a simple edit to add "... or UUID (described in Internet RFC 4122)... " fixes that. RFC 4122 then also needs to be added to the NormRefs.

petervwyatt commented 5 months ago

Sub-issue 2

I don't think anything should be changed in ISO 32000-2 since "general PDF" only needs to be concerned with PDF file identifiers in the trailer (Sub-Issue #1) as XMP metadata is only treated as a general blob (stream). All 4 instances of XMP InstanceID and DocumentID are in Annex H and they have since been removed via Errata #402.

However, I do also acknowledge a lack of clarity about XMP, PDF file identifiers, and general XMP "best practice" for the subsets such as PDF/A. I will see if anyone (PDF/A TWG??) might like to take up this topic...

lrosenthol commented 5 months ago

RFC 4122 was superseded by RFC 9562 (https://datatracker.ietf.org/doc/html/rfc9562).

petervwyatt commented 5 months ago

PDF TWG agree to striking the "content" phrasing but not to add UUID reference.