veraPDF / veraPDF-library

Industry supported, open source PDF/A validation library
http://verapdf.org/software
GNU General Public License v3.0
270 stars 48 forks source link

PDF/UA-2 test issues #1413

Closed faceless2 closed 3 months ago

faceless2 commented 6 months ago

Some issues we've found which hopefully won't need much explanation, so I've lumped them into one.

And some specific issues

PDF_UA-2/8.2 Logical structure/8.2.2 Real content/8.2.2-t01-fail-a.pdf Object 12.0, the StructElem for the Document, doesn't have a "P" pointer back to StructTreeRoot (required in table 355)

PDF_UA-2/8.4 Text representation for content/8.4.3 Replacements and alternatives for text/8.4.3-t03-fail-a.pdf This has an invalid Alt tag on an MCID inside a Pattern. But MCIDs inside a pattern can never be made visible in the StructureTree, because of the requirement that every MCID has one parent - patterns may be reused everywhere. The same statement would apply to MCIDs in Type3 fonts and mask XObjects. Section 8.4 applies to "Text representation for content" - this isn't content.

PDF_UA-2/8.5 Real content without textual semantics/8.5.1 General/8.5.1-t01-fail-a.pdf This one has me puzzled. "line art content is not marked by a Figure" - the only vector operations I can see in there are the setting of the clip rectangle - definitely not line art, it doesn't mark the page - or the highlight in the highlight annotation. Is it the highlight? That's already tagged with /Annot, which semantically appropriate. We even have note 5 under 8.2.2: ''Unlike PDF/UA-1, this document clearly specifies that the use of images or vector-based drawings does not always require a Figure structure element''.

PDF_UA-2/8.9 Annotations/8.9.2 Semantics and content/8.9.2.1 General/8.9.2.1-t01-fail-a.pdf Annotation object 3.0 has StructParent 3, but item 3 in the StructTreeRoot.ParentTree is not an OBJR referencing that annotation.

faceless2 commented 6 months ago

I have to follow myself up. Although the ISO14289-predis I have here still says this in 6.2

A file shall not contain any feature that is deprecated in ISO 32000-2

I recall that we agreed this wasn't the intention and that Info is allowed. And to quote an email exchange with Duff on this just yesterday.

The /Info IS “allowed”… “Deprecated” does not mean “not allowed”… see clause 3.15 in 32k-2.

So for the ModDate and CreationDate point above, it's our validator that needs to changes, not your files.

DuffJohnson commented 6 months ago

I have to follow myself up. Although the ISO14289-predis I have here still says this in 6.2

A file shall not contain any feature that is deprecated in ISO 32000-2

I recall that we agreed this wasn't the intention and that Info is allowed.

Correct. We changed the standard to read:

— a file should not contain any feature that is deprecated in ISO 32000-2;

(emphasis added)

And to quote an email exchange with Duff on this just yesterday.

The /Info IS “allowed”… “Deprecated” does not mean “not allowed”… see clause 3.15 in 32k-2.

So for the ModDate and CreationDate point above, it's our validator that needs to changes, not your files.

bdoubrov commented 6 months ago

Thanks, @faceless2 ! Most of these were fixed and already merged. Two remaining questions / comments:

_PDF_UA-2/8.4 Text representation for content/8.4.3 Replacements and alternatives for text/8.4.3-t03-fail-a.pdf This has an invalid Alt tag on an MCID inside a Pattern. But MCIDs inside a pattern can never be made visible in the StructureTree, because of the requirement that every MCID has one parent - patterns may be reused everywhere. The same statement would apply to MCIDs in Type3 fonts and mask XObjects. Section 8.4 applies to "Text representation for content" - this isn't content._

I don't see any Pattern objects in this test. Alt tag occurs in the /Form MCID inside annotation appearance. I'd say, PUA would not be allowed here.

_PDF_UA-2/8.5 Real content without textual semantics/8.5.1 General/8.5.1-t01-fail-a.pdf This one has me puzzled. "line art content is not marked by a Figure" - the only vector operations I can see in there are the setting of the clip rectangle - definitely not line art, it doesn't mark the page - or the highlight in the highlight annotation. Is it the highlight? That's already tagged with /Annot, which semantically appropriate. We even have note 5 under 8.2.2: ''Unlike PDF/UA-1, this document clearly specifies that the use of images or vector-based drawings does not always require a Figure structure_

Indeed, we have taken 8.5.1 as a machine requirement: Any non-textual content shall be marked as a Figure or a Formula. But I see the discussion at PDF/UA TWG mailing list which tends to agree that this is an author's choice => human test. I'll wait till the next PDF/UA TWG call to reconfirm this.

faceless2 commented 6 months ago

(woops, I realise I should have filed this issue on veraPDF-corpus)

I don't see any Pattern objects in this test. Alt tag occurs in the /Form MCID inside annotation appearance. I'd say, PUA would not be allowed here.

Sorry, my error - yes it's an annotation. But actually the situation is almost the same.

That MCID is never referenced from the StructureTree - it's within an annotation, and annotations are effectlvely "black boxes" to the StructureTree - their content does not add nodes to the tree. This applies to any item with "StructParent" rather than "StructParents", like that annotation, because with StructParent it IS a content item - it doesn't CONTAIN content items.

Quoting part of tables 359

An object may be either a content item in its entirety or a container for marked-content sequences that are content items, but not both

So as that MCID is not in a "container for marked content sequences" that is referenced from the Structure Tree, it doesn't count as content. This argument also applies to any XObject with a StructParent, rather than StructParents - it has to be this way, because such an XObject could be in the tree multiple times.

bdoubrov commented 6 months ago

@faceless2 yes, indeed I see additionally in PDF/UA-2, 8.9.2.1

ISO 32000-2 enables substructure within annotation appearance streams via marked content references. Files in conformity with this document shall not use marked content references to substructure annotation appearance streams (see ISO 32000-2:2020, Table 357). NOTE 4 The effect of the above clause is to require that annotations are included as whole objects in a single structure element.

However, I believe if ActualText is specified on any marked content sequence included or not into the structure tree, it shall not use PUA as per 8.4.3. I'm less certain about Alt entry in similar case. So we'll modify the test to have PUA present in ActualText.

faceless2 commented 6 months ago

Well, I have to say I still disagree :-) Again, the spec text, this time from 8.4.3:

In all cases, where real content maps to Unicode PUA values, an ActualText or Alt entry shall be present.

These requirements only applies to "real content": they wouldn't apply to an Artifact, they also wouldn't apply to a pattern, the internals of a Type 3 font or an annotation. We can be certain they're not "real content" because if they were they would have to be reachable from the Structure Tree, and they're not.

If I still haven't convinced you then I think we'll need to bounce this to the PDF/UA TWG

bdoubrov commented 6 months ago

I'll post a message to the PDF/UA TWG

bdoubrov commented 3 months ago

As per discussion at PDF/UA TWG, processing of Alt and ActualText properties of marked content sequences within Annotaion appearances, patterns and Type3 font glyphs doesn't make any sense and thus is disabled in veraPDF.

The corresponding test files are removed from the corpus to avoid confusion.

MaximPlusov commented 3 months ago

Included into release 1.26