veraPDF / veraPDF-library

Industry supported, open source PDF/A validation library
http://verapdf.org/software
GNU General Public License v3.0
268 stars 48 forks source link

Conformance to PDF/UA and PDF/A should not be mutually exclusive #1456

Closed moritzfl closed 2 months ago

moritzfl commented 2 months ago

The PDF association states that PDF/A and PDF/UA are not mutially exclusive. https://pdfa.org/pdf-standards-are-not-mutually-exclusive/

And while it may seem weird to include a schema for PDF/UA conformance within a PDF/A document, external schemata are indeed allowed by PDF/A and thus this can be used to achieve PDF/A and PDF/UA conformance.

This is relevant in practice as the European Accessibility Act dictates that every country in the EU must soon (next year) have solutions in place so that PDF/UA documents are provided for all customers that ask for accessible documents.

For archiving, it would be desirable to keep files that conform to PDF/UA while also ensuring PDF/A conformity.

An example for such a file that also passes Adobe Acrobats preflight checks is the following: https://pdfa.org/wp-content/uploads/2018/03/Flyer-PDF-Association-PDFUA1-PDFA2u.pdf

It would be great if VeraPDF could cover this combination in the included checks accordingly. At the moment, such a file fails the VeraPDF check for PDF/A conformance. image2 image

image3
bdoubrov commented 2 months ago

@moritzfl This seems to be a duplicate of #1414. Incidentally, just now we are testing the prototype implementation of this feature and here is the report for your PDF file, which is generated by this prototype. Is this what you would expect to have?

htmlReport.zip

moritzfl commented 2 months ago

This is somewhat similar to the issue that I opened here but not exactly the same.

1414 covers the case that conformance to two specifications is checked if a file claims to be a PDF/A and PDF/UA file at the same time.

This issue goes more towards the PDF/A and PDF/UA specifications themselves. In PDF/UA, XMP metadata does not have any real restriction. So the file I provided would pass a PDF/UA check (as also shown in the report that you posted).

However, the file does not pass a PDF/A conformance check as it includes metadata for PDF/UA that is not expected for PDF/A files. And while this generally seems to make sense, the file does include the schema description for PDF/UA metadata in the XMP and thus should conform to PDF/A.

In other words - it should be possible to have a file that conforms to both PDF/A and PDF/UA at the same time and passes the checks in VeraPDF accordingly.

bdoubrov commented 2 months ago

There is no conflict between PDF/A and PDF/UA. However, in order to comply to both, the document must have all custom XMP properties defined by an appropriate extension schema as required by PDF/A. In your PDF example there are 4 properties (see the report) which are not a part of predefined schemas support by PDF/A-2:

xmpTPg:HasVisibleTransparency xmpTPg:HasVisibleOverprint xmpTPg:SwatchGroups illustrator:Type

moritzfl commented 2 months ago

Ok - my bad.

Fixed the issues in the file and it now passes the PDF/A and PDF/UA conformance checks.

Its unfortunate that the files provided through the PDF Association to showcase mutual conformity to PDF/A and PDF/UA have such issues ...

Here is the fixed file for reference: Flyer-PDF-Association-PDFUA1-PDFA2u_fixed.pdf

DuffJohnson commented 2 months ago

Thanks for highlighting (and fixing!) the problem with our file!

This isn't the latest version of the "PDF Association Flyer" - it's an older one. The current version is here. I hope that it doesn't also have this problem... but it might.

Where did you find this old file linked on pdfa.org? We'd like to ensure that no-one downloads it again!

moritzfl commented 2 months ago

I believe, I downloaded the file from a link in this article (the file behind the "this document" link): https://pdfa.org/pdf-standards-are-not-mutually-exclusive/

I'll check the current version out on Monday and get back to you with the results.

DuffJohnson commented 2 months ago

Thank you! At the risk of pissing off the Time Lords I've now corrected the link on that page to point to the latest version of the flyer. I look forward to your assesssment!

moritzfl commented 2 months ago

The current files still fail validation through VeraPDF.

I included a decoded version of the file for an easier analysis (pdf-association-flyer-decoded.pdf, I still like using Ctrl+F in a text editor sometimes). All object references mentioned in the following text refer to the decoded file. However, only the object numbers are different, the structure is pretty much the same as in the original file.

Object 1031 references the metadata object 1273 which contains the issues mentioned in the report (the report is also included in the zip-file).

Object 1031 is referenced as part of the Properties-Dictionary (as MC0 entry) from object 994 which is the Resources dictionary of the first page in the document.

Navigating through the structure from the first page of the PDF-Association-flyer.pdf document in a similar manner, you will find the same issue there (in PDFBox Debugger, the path to the metadata object with the issue would be Root/Pages/Kids/[0]/Resources/Properties/MC0/Metadata).

PDF-Association-flyer-current-version.zip

Btw. the fixed version of the old flyer that I uploaded previously passed validation in VeraPDF but is not a clean example. I basically just threw an automated conversion at it (UPDF -> Save as PDF/A). This conversion seems to have removed the entire Properties dictionary and thus references from within the content stream to /MC0, /MC1 and /MC2 no longer work.