Closed stechio closed 5 months ago
I think the phrases that mention "the contents" are what is causing confusion - this is not supposed to be referring to the human-interpretable content in a PDF document, but the actual bunch of bytes that make up the file. I think they're unnecessary and simply restating without that term addresses the issue:
"The first byte string shall be a permanent identifier based on
the contents ofthe PDF file at the time it was originally created and shall not change when the PDF file is updated. The second byte string shall be a changing identifier based on the PDF file’s contentsat the time it was last updated (see 7.5.6, "Incremental updates")."
In regards to the mention of MD5 in the last para of 14.4: a simple edit to add "... or UUID (described in Internet RFC 4122)... " fixes that. RFC 4122 then also needs to be added to the NormRefs.
I don't think anything should be changed in ISO 32000-2 since "general PDF" only needs to be concerned with PDF file identifiers in the trailer (Sub-Issue #1) as XMP metadata is only treated as a general blob (stream). All 4 instances of XMP InstanceID
and DocumentID
are in Annex H and they have since been removed via Errata #402.
However, I do also acknowledge a lack of clarity about XMP, PDF file identifiers, and general XMP "best practice" for the subsets such as PDF/A. I will see if anyone (PDF/A TWG??) might like to take up this topic...
RFC 4122 was superseded by RFC 9562 (https://datatracker.ietf.org/doc/html/rfc9562).
PDF TWG agree to striking the "content" phrasing but not to add UUID reference.
SUB-ISSUE 1: File identifiers definition
Subclause 14.4 (File identifiers) states:
To me, the requirement of being based on the contents of the PDF file is arbitrary and misleading; the fact that, since PDF 2.0, even the suggested MD5-based hashing algorithm has dropped its main content-related input ("The values of all entries in the file's document information dictionary [...]" (see PDF 1.7)) speaks volumes — now it retains merely "The size of the PDF file in bytes" as a barely content-related input!
As later stated, the sole relevant attribute of a computed identifier is uniqueness (which implies the statistical robustness of the generative algorithm against collisions):
Even the NOTE to subclause 14.4 in PDF 1.7 stressed that "all that matters is that the identifier is likely to be unique"... What about standard generative algorithms like, say, UUID as alternatives to the suggested MD5-based hashing algorithm? I do not advocate to dismiss the suggested hashing algorithm, just to reword subclause 14.4 in order to make more clear the distinction between specification (identifier uniqueness) and implementation (algorithms suitable to meet the specification).
Therefore, IMO, the sentence in subclause 14.4 should be reformulated, like so:
The value of this entry shall be an array of two unique byte strings, each at least 16 bytes long. The first byte string shall be a permanent identifier, not to change when the PDF file is updated. The second byte string shall be a changing identifier, computed when the PDF file is updated (see 7.5.6, "Incremental updates").
While the last paragraph ("PDF writers should attempt to ensure the uniqueness [...]") would be replaced by:
Identifier uniqueness should be attempted by employing a suitable generative algorithm, such as UUID (described in Internet RFC 4122); this may also be achieved by means of a message digest algorithm such as MD5 (described in Internet RFC 1321), using the following information:
- the current time;
- a string representation of the PDF file's location;
- the size of the PDF file in bytes.
SUB-ISSUE 2: File identifiers mapping to XMP metadata
Apparently, there is a lack of documentation at core PDF level regarding the relation between PDF file identifiers and document identifiers in XMP metadata: while Table 349 in subclause 14.3.3 (Document information dictionary) suggests a precise mapping between information entries and document-level XMP metadata, there is no indication for the reconciliation between file identifier's (permanent and changing) byte strings and semantically-corresponding document identifiers (
DocumentID
andInstanceID
) in Media Management namespace. Although higher-level PDF specs in ISO stack (such as PDF/A) are designed to add constraints atop the core PDF spec, the latter could nonetheless provide a general mapping suggestion the same way it already does for the document information dictionary entries (after all, even the mapping of those entries is mentioned both at core PDF and PDF/A levels!).In XMP metadata, document identifiers are typed as
GUID
, which the XMP spec (part 1 (ISO 16684-1:2011), Annex A) describes eloquently (emphasis is mine):According to the description here above, PDF file identifier byte strings seem compatible with document identifiers in XMP Media Management namespace — furthermore, AFAIK, PDF/A allows freedom of identification scheme (identifiers may be externally based (eg, ISBN) or internally based (eg, UUID)). Could it be acceptable in general cases (ie, without specific constraints) to assign the permanent file identifier byte string of a given PDF file to the corresponding
xmpMM:DocumentID
property, and its changing file identifier byte string to the correspondingxmpMM:InstanceID
property?This way, it would be possible to have, by default, a single pair of identifiers, without unnecessary redundancies, like so (byte strings here below are obviously fictitious):