Closed ronaldtse closed 1 month ago
Hi, I seen the Upwork job post for this task and am posting here just to get some clarification.
I have read through the docs a bit for veraPDF Docker Image. The profiles available (with latest dicker image) that seem to match your interests are PDF/A-3U, PDF/A-3A, PDF/A-3B. I figure PDF/A-3A more closely matches what you want but just making sure
I am a little confused on the process though. Am I uploading (docker volume mount) any random PDF (that I own or find on internet) straight into veraPDF and expecting to get the XML as the test result. Also do you want the XML logged just to console or exported as a GitHub Artifact that can be viewed separately from the GitHub Action run. I can also send it to a rest endpoint or s3 bucket, etc..
Hello
I'm assuming you want to validate some PDFs produced by mn2pdfTests.java.
I've added a skeleton here (see the very bottom of the file) https://github.com/alex-sc/mn2pdf/blob/main/.github/workflows/test.yml
# Generate test PDFs
- run: mvn test
- run: |
docker run -d -p 8080:8080 -v ./target:/home/folder verapdf/rest:latest
sleep 5
curl -F "url=file:///home/folder/G.191.pdf" localhost:8080/api/validate/url/A-3A
Looks like there's another solution - integrate the PDF validation right in the Java test by importing the VeraPDF library into the project and doing the validation there, but I didn't check this approach further
Test PDF sample: test_attachments.tc1.pdf
curl -F "file=@test_attachments.tc1.pdf" localhost:8080/api/validate/3a -H "Accept:application/xml" > res.xml
Report:
<?xml version='1.0' encoding='utf-8'?>
<report>
<buildInformation>
<releaseDetails id="core" version="1.26.1" buildDate="2024-05-16T16:30:00Z"/>
<releaseDetails id="verapdf-rest" version="1.26.1" buildDate="2024-05-24T15:12:00Z"/>
<releaseDetails id="validation-model" version="1.26.1" buildDate="2024-05-16T18:12:00Z"/>
</buildInformation>
<jobs>
<job>
<item size="73771">
<name>test_attachments.tc1.pdf</name>
</item>
<validationReport jobEndStatus="normal" profileName="PDF/A-3A validation profile" statement="PDF file is not compliant with Validation Profile requirements." isCompliant="false">
<details passedRules="147" failedRules="7" passedChecks="20339" failedChecks="103">
<rule specification="ISO 19005-3:2012" clause="6.6.2.3.1" testNumber="1" status="failed" failedChecks="1">
<description>All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, ISO 19005-1 or this part of ISO 19005, or any extension schemas that comply with 6.6.2.3.2</description>
<object>XMPProperty</object>
<test>isPredefinedInXMP2005 == true || isDefinedInMainPackage == true || isDefinedInCurrentPackage == true</test>
<check status="failed">
<context>root/document[0]/metadata[0](8 0 obj PDMetadata)/XMPPackage[0]/Properties[9](http://www.aiim.org/pdfua/ns/id/ - pdfuaid:part)</context>
</check>
</rule>
<rule specification="ISO 19005-3:2012" clause="6.2.4.3" testNumber="4" status="failed" failedChecks="86">
<description>DeviceGray shall only be used if a device independent DefaultGray colour space has been set when the DeviceGray colour space is used, or if a PDF/A OutputIntent is present</description>
<object>PDDeviceGray</object>
<test>gOutputCS != null</test>
<check status="failed">
<context>root/document[0]/pages[0](26 0 obj PDPage)/contentStream[0](24 0 obj PDContentStream)/operators[22]/colorSpace[0]</context>
</check>
</rule>
<rule specification="ISO 19005-3:2012" clause="6.5.1" testNumber="1" status="failed" failedChecks="1">
<description>The Launch, Sound, Movie, ResetForm, ImportData, Hide, SetOCGState, Rendition, Trans, GoTo3DView and JavaScript actions shall not be permitted. Additionally, the deprecated set-state and no-op actions shall not be permitted</description>
<object>PDAction</object>
<test>S == "GoTo" || S == "GoToR" || S == "GoToE" || S == "Thread" || S == "URI" || S == "Named" || S == "SubmitForm"</test>
<check status="failed">
<context>root/document[0]/pages[4](63 0 obj PDPage)/annots[0](65 0 obj PDLinkAnnot)/A[0](64 0 obj PDAction)</context>
</check>
</rule>
<rule specification="ISO 19005-3:2012" clause="6.2.11.4.2" testNumber="2" status="failed" failedChecks="4">
<description>If the FontDescriptor dictionary of an embedded CID font contains a CIDSet stream, then it shall identify all CIDs which are present in the font program, regardless of whether a CID in the font is referenced or used by the PDF or not</description>
<object>PDCIDFont</object>
<test>fontFile_size == 0 || fontName.search(/[A-Z]{6}\+/) != 0 || containsCIDSet == false || cidSetListsAllGlyphs == true</test>
<check status="failed">
<context>root/document[0]/pages[0](26 0 obj PDPage)/contentStream[0](24 0 obj PDContentStream)/operators[254]/font[0](EAAAAB+Inter-Bold)/DescendantFonts[0](EAAAAB+Inter-Bold)</context>
</check>
</rule>
<rule specification="ISO 19005-3:2012" clause="6.6.2.3.1" testNumber="2" status="failed" failedChecks="6">
<description>All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, ISO 19005-1 or this part of ISO 19005, or any extension schemas that comply with 6.6.2.3.2</description>
<object>XMPProperty</object>
<test>isValueTypeCorrect == true</test>
<check status="failed">
<context>root/document[0]/metadata[0](8 0 obj PDMetadata)/XMPPackage[0]/Properties[4](http://purl.org/dc/elements/1.1/ - dc:title)</context>
</check>
</rule>
<rule specification="ISO 19005-3:2012" clause="6.3.2" testNumber="1" status="failed" failedChecks="4">
<description>Except for annotation dictionaries whose Subtype value is Popup, all annotation dictionaries shall contain the F key</description>
<object>PDAnnot</object>
<test>Subtype == "Popup" || F != null</test>
<check status="failed">
<context>root/document[0]/pages[1](48 0 obj PDPage)/annots[0](33 0 obj PDLinkAnnot)</context>
</check>
</rule>
<rule specification="ISO 19005-3:2012" clause="6.6.4" testNumber="1" status="failed" failedChecks="1">
<description>The PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema</description>
<object>MainXMPPackage</object>
<test>Identification_size == 1</test>
<check status="failed">
<context>root/document[0]/metadata[0](8 0 obj PDMetadata)/XMPPackage[0]</context>
</check>
</rule>
</details>
</validationReport>
<duration start="1725217823013" finish="1725217823924">00:00:00.911</duration>
</job>
</jobs>
<batchSummary totalJobs="1" failedToParse="0" encrypted="0" outOfMemory="0" veraExceptions="0">
<validationReports compliant="0" nonCompliant="1" failedJobs="0">1</validationReports>
<featureReports failedJobs="0">0</featureReports>
<repairReports failedJobs="0">0</repairReports>
<duration start="1725217822998" finish="1725217823946">00:00:00.948</duration>
</batchSummary>
</report>
VeraPDF does have a full CLI interface too - see the CLI doco if that is easier. You'll need Java.
Note that veraPDF messages can also be a bit technical... and this may be some of what FOP corrects/adds when you enable PDF/A mode (e.g. hopefully it will see the use of DeviceGray and then add an Output Intent profile for you). And the metadata issues should obviously get corrected too.
To answer @FullStackIndie's questions:
I have read through the docs a bit for veraPDF Docker Image. The profiles available (with latest dicker image) that seem to match your interests are PDF/A-3U, PDF/A-3A, PDF/A-3B. I figure PDF/A-3A more closely matches what you want but just making sure
Those with disabilities or needing to use assistive technologies (screen readers, screen magnifiers, etc) require that the PDFs generated are Tagged PDF. This means that also making it PDF/A will exceed PDF/A-3B (B = "basic") so don't even bother with that setting. The choice is then PDF/A-3u ("Unicode") or PDF/A-3a ("accessible"). PDF/A-3a is by far the better choice since it preserves the document’s logical structure and content text stream in reading order which is also what PDF/UA and general accessibility require. So please strive for PDF/A-3a.
I am a little confused on the process though. Am I uploading (docker volume mount) any random PDF (that I own or find on internet) straight into veraPDF and expecting to get the XML as the test result. Also do you want the XML logged just to console or exported as a GitHub Artifact that can be viewed separately from the GitHub Action run. I can also send it to a rest endpoint or s3 bucket, etc.
Yes, veraPDF can check any random PDF but will subsequently generate error messages about missing metadata, since all PDF subsets define their conformance via their metadata. veraPDF's default behaviour ("Auto") is to check the metadata and then check whatever conformance level it finds there (see also this veraPDF issue to support multiple conformance levels). In the case of a random PDF, there will be no conformance-level info in the XMP metadata so you'll need to manually set which PDF-flavour you want and expect errors about missing metadata - but any other failures reported are valid.
As mentioned above, veraPDF also has a comprehensive CLI if that is easier than a Docker container. It needs Java.
@ronaldtse do we need to use the veraPDF Docker container, or would be better to integrate veraPDF into mn2pdf
?
I've tried to integrate the veraPDF directly into the mn2pdf
application (not released yet).
I study how to convert the checking result from:
ValidationResult [flavour=3a,
totalAssertions=20438,
assertions=[TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012,
clause=6.5.1,
testNumber=1],
status=failed,
message=The Launch,
Sound,
Movie,
ResetForm,
ImportData,
Hide,
SetOCGState,
Rendition,
Trans,
GoTo3DView and JavaScript actions shall not be permitted. Additionally,
the deprecated set-state and no-op actions shall not be permitted,
location=Location [level=CosDocument,
context=root/document[0]/pages[4](64 0 obj PDPage)/annots[0](66 0 obj PDLinkAnnot)/A[0](65 0 obj PDAction)],
locationContext=null,
errorMessage=null],
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012,
clause=6.6.2.3.1,
testNumber=2],
status=failed,
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification,
ISO 19005-1 or this part of ISO 19005,
or any extension schemas that comply with 6.6.2.3.2,
location=Location [level=CosDocument,
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[4](http://purl.org/dc/elements/1.1/ - dc:title)],
locationContext=null,
errorMessage=null],
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012,
clause=6.6.2.3.1,
testNumber=2],
status=failed,
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification,
ISO 19005-1 or this part of ISO 19005,
or any extension schemas that comply with 6.6.2.3.2,
location=Location [level=CosDocument,
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[6](http://purl.org/dc/elements/1.1/ - dc:description)],
locationContext=null,
errorMessage=null],
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012,
clause=6.6.2.3.1,
testNumber=2],
status=failed,
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification,
ISO 19005-1 or this part of ISO 19005,
or any extension schemas that comply with 6.6.2.3.2,
location=Location [level=CosDocument,
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[7](http://purl.org/dc/elements/1.1/ - dc:creator)],
locationContext=null,
errorMessage=null],
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012,
clause=6.6.2.3.1,
testNumber=2],
status=failed,
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification,
ISO 19005-1 or this part of ISO 19005,
or any extension schemas that comply with 6.6.2.3.2,
location=Location [level=CosDocument,
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[11](http://www.aiim.org/pdfua/ns/id/ - pdfuaid:part)],
locationContext=null,
errorMessage=null],
TestAssertion [ruleId=RuleId [specification=ISO 19005-3:2012,
clause=6.6.2.3.1,
testNumber=1],
status=failed,
message=All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification,
ISO 19005-1 or this part of ISO 19005,
or any extension schemas that comply with 6.6.2.3.2,
location=Location [level=CosDocument,
context=root/document[0]/metadata[0](9 0 obj PDMetadata)/XMPPackage[0]/Properties[11](http://www.aiim.org/pdfua/ns/id/ - pdfuaid:part)],
locationContext=null,
errorMessage=null]],
isCompliant=false]
into more convenient format like HTML.
veraPDF GUI application (https://docs.verapdf.org/gui/) has HTML output feature (realized via XML to HTML with XSLT), I'll investigate how to integrate it into mn2pdf
.
@Intelligent2013 isn't it easiest to keep it just as a verification step using the Docker container? It doesn't need to be part of mn2pdf? Or do you prefer integrating it into the mn2pdf local test flow?
@Intelligent2013 - if you use veraPDF CLI then you can explicitly set the output format you want to be text, json, raw (i.e. xml) or html. The Docker container is relatively new for veraPDF so I will ask if there is a way to set the CLI via Docker...
@Intelligent2013 isn't it easiest to keep it just as a verification step using the Docker container?
@ronaldtse questions:
veraPDF-library
contains such XSLT).@Intelligent2013 I think we just want to ensure that the PDF outputs we have comply with PDF/A3-a. You will be the main person looking at it.
I believe a separate GHA workflow that shows individual validation failures in the GHA output would work well.
Using the docker container is preferred because we don't need a local workflow. The output can also contain HTML if it helps you.
If the verification step fails for mn2pdf, the build should be marked "failed". We should generate a set of Metanorma-sourced PDF files to test mn2pdf with. Thoughts welcome!
Workflow for PDF checking by veraPDF added in https://github.com/metanorma/mn-native-pdf/pull/743.
How verapdfcheck.yml
is working:
fountainhead/action-wait-for-check@v1.2.0
)dawidd6/action-download-artifact@v6
)Example output: https://github.com/metanorma/mn-native-pdf/actions/runs/10689964432/job/29633282268?pr=743
if the verification step is fail, then should we stop any further actions OR just put the report near the PDF and continue the further actions?
I wouldn't treat veraPDF failures as a complete failure and stop so I suggest save the report(s) and continue. I'm also unsure if FOP will fail to produce a PDF or not, such as if it detects an issue when attempting to create PDF/A or PDF/UA. I'm also not sure how much Apache FOP will automatically do things vs. needing the author to correct their AsciiDoc content. For Tagged PDF and PDF/UA, it is highly likely the author will need to do something anyway (e.g. fix alt-text, ensure tables are regular, change colors to have better contrast, etc) but for PDF/A they still might need to do something...
I've passed several Qs on to the veraPDF and invited them to contribute to this discussion. @bdoubrov
PDF/A-3 checking using the veraPDF Docker container integrated into the repository mn-native-pdf
.
From @petervwyatt at:
Will be used by:
264
We want to create a GHA workflow that uses the veraPDF container.
Quoted:
To validate your local files you need to add folder with files to the docker container. To run the veraPDF rest image with your local files run docker image with bind mount
-v /local/path/of/the/folder:/home/folder
. For example, to run the veraPDF rest image from DockerHub with your local files:To obtain XML: