pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
62 stars 2 forks source link

PDF/A-4 (ISO 19005-4): handling of embedded, associated files which are not PDF themselves #385

Open u-fischer opened 2 months ago

u-fischer commented 2 months ago

We are producing tagged 2.0-PDFs which attach mathml and tex files as associated files (AF) to Formula structure elements. Trying to validate these files also against PDF/A-4 we got failures where we are unsure about the right handling according the spec.

In our files we have AF with the registered media type application/mathml+xml and the unregistered (but wildly used see e.g. wikipedia) media type application/x-tex. Both types are plain text files.

A part of the AF are currently listed in the EmbeddedFiles name tree but we can (and also want) produce files where none of the AF are listed.

An example document is mathml-AF-ex1

Remark: the following quotes from ISO 19005-4 are from a draft and should be verified against the official version.

PDF/A-4 requirements

Question 1

6.9 Embedded files writes

All embedded files, as part of a file specification dictionary, shall conform with ISO 19005-1, ISO 19005-2 or this international standard.

Question 2

6.9 Embedded files continues with

Each embedded file’s file specification dictionary shall contain [...] A Subtype key whose value is a valid IANA Media Type.

Table 43 — Entries in a file specification dictionary in ISO 32000-2:2020 does not list a Subtype in the file specification dictionary. The Subtype key is instead listed in Table 44 — Additional entries in an embedded file stream dictionary. This looks like an error in the spec.

Question 3

Each embedded file’s file specification dictionary should contain the Desc key.

This relates to question 1: Does this apply to every embedded file, even to the ones not listed in the EmbeddedFiles name tree?

PDF/A-4f

Due to the failure we tried to validate against A-4f and the document passed. But it is not clear if this actually the correct way to handle them. The spec says here

A PDF/A-4f conforming file shall contain an EmbeddedFiles key in the name dictionary of the document catalog dictionary.

All file specification dictionaries present in the value of the EmbeddedFiles key shall conform with the requirements of 6.9, except that the embedded files may be of any type.

The exception of any type is rather vage. Does that refers only to the requirement regarding a registered media type mentioned in question 2 above or does that also lift the requirement that the files shall conform with ISO 19005-1, ISO 19005-2 or this international standard?

Although embedded files that do not comply with any part of this document should not be rendered by a conforming PDF/A-4f processor, a conforming interactive PDF/A-4f processor should enable the extraction of any embedded file. The conforming interactive PDF/A-4f processor should also require an explicit user action to initiate the process.

What does that means for AF files meant for accessibility support like our mathml files? Would a reader have to ask user before passing such a mathml to AT software?

petervwyatt commented 2 months ago

See also PDF/A TWG Issue #40 - only visible to PDF Association Members who are members of the PDF/A TWG.

@u-fischer - I suggest you join the PDF/A TWG for this discussion...

petervwyatt commented 2 months ago

Note also that although this issue only mentions PDF/A-4, the same feature is PDF/A-3 so some consistency would be expected between PDF 1.7 and PDF 2.0 PDF/A files. Parking this issue so it can be handled by the PDF/A TWG.