Different validation results with GreenField and PDFBox parser

tknall commented 2 years ago

Validating a PDF/A-2b compliant PDF document with embedded CID TrueType font subset leads to different results, depending on the underlying parser engine:

veraPDF 1.21.159 (PDFBox): validationReport-veraPDF-1.21.159-PDFBox.xml
veraPDF 1.21.161 (GreenField): validationReport-veraPDF-1.21.161-GreenField.xml

While GreenField approves PDF/A-2b compliance (as do other validators like callas pdfaPilot / Adobe Acrobat preflight), the PDFBox instance fails validation with this error message: "A CID Font subset does not define CIDSet entry in its Descriptor dictionary"

When inspecting the demo file Hello_World_PDFA-2b.pdf we cannot reproduce the issue since the allegedly missing CIDSet entry is present:

Hello_World_PDFA-2b-structure

P.S. Fun fact: the demo file has been created using PDFBox (2.0.25)

Which one is right? PDFBox or Greenfield?

bdoubrov commented 2 years ago

The issue is in the difference of internal font engines of veraPDF greenfield and PDFBox. In more detail, greenfield assumes the the glyph with GID=0 is always implicitly present in CID-based fonts, while PDFBox assumes that such glyph does not exist.

To fix this issue we need to patch the internals of PDFBox. The behavior of veraPDF greenfield is correct.

bwegge commented 1 year ago

I have observed a similar difference between the two parsers when verifying the attached pdf file. pdf-a-unicode.pdf

With greenfield, veraPDF complains about missing glyph to unicode mappings (for the lower case \mu, which should be mapped to U+1D707 with recent newpx font packages), whereas the PDFBox version confirms compliance with pdf/a-2u. I am a bit unsure which version to trust (Who verifies the pdf verifier?), but I hope the PDFBox result is the correct one.

The source code for the attached pdf uses the newpx font which recently included unicode mappings:

\documentclass{scrartcl}
\usepackage{newpxtext,newpxmath}
\usepackage[a-2u]{pdfx}
\begin{document}
Some greek characters seem to miss unicode mappings: $\mu$  % <- verification fails; comment out to succeed
Works for others: $\Sigma$
\end{document}

bdoubrov commented 1 year ago

Hi @bwegge Thanks for reporting this issue. It is not a simple one, and this is why we have a difference between PDFBox and greenfield.

In short, there is a syntax error in the unicode mapping of the newpx font. This error is treated differently in PDFBox and greenfield parsers, which resulted in an extra validation error in the latter case.

In more detail, the unicode mapping in PDF fonts is defined via so-called /ToUnicode entry defining how character code from PDF page description are mapped to Unicode. Here is the problematic ToUnicode map: ToUnicode.txt

In particular, this line:

<0a> <1a> <d835def9>

which says that byte characters in the range from 0A to 1A in the PDF page content have to be mapped to Unicode characters UTF16 "D835DEF9" and further on. This syntax is described in PDF 1.7 spec (ISO 32000-1, clause 9.10.3). However, there is an additional format requirement that says:

When defining ranges of this type, the value of the last byte in the string shall be less than or equal to 255 − (srcCode2 − srcCode1). This ensures that the last byte of the string shall not be incremented past 255; otherwise, the result of mapping is undefined.

This requirement is clearly violated here, and thus the unicode mapping becomes undefined.

We have adjusted the Greenfield parser implementation so that it:

reports this error in the ToUnicode mapping
has the same further behavior as the PDFBox one.

As far as we can see, this is also how Adobe Acrobat handles this particular format error. So, as of the latest dev build of veraPDF both PDFBox and Greenfield will report that there are no PDF/A issues found in your document. But Greenfield will additionally report the above error in the embedded Unicode. This error is shown as log message in the console and can also be optionally included into the validation report.

bwegge commented 1 year ago

Hi Boris, thanks a lot for your reply and for looking into the issue. Could the actual problem also be caused by the pdflatex compiler (or whatever tool assembles the CMap) in case it (sometimes unwarily) merges adjacent codes to ranges? Since the newpx font package in /usr/share/texlive/texmf-dist/fonts/type1/public/newpx/NewPXMI_gnu.pfb specifies the mappings individually on separate lines, it seems to be correct on their part (i.e., not causing some byte overflow):

dup 23/u1D706 put
dup 24/u1D707 put
dup 25/u1D708 put

(I am no expert and have no idea if the mappings in the produced pdf are actually taken from this file or another, it's seemed just likely to pick the one in the type1 folder since I use the T1 option for inputenc.)

More specifically: In (do_)write_tounicode in https://github.com/TeX-Live/texlive-source/blob/4f771e41a6c3799e9d16e44633c7fa95dc41f1bc/texk/web2c/pdftexdir/tounicode.c#L382 as well as https://github.com/TeX-Live/texlive-source/blob/4f771e41a6c3799e9d16e44633c7fa95dc41f1bc/texk/web2c/luatexdir/font/tounicode.c#L394, it seems that ranges are identified with adjacent unicode codes, but I don't see any check for an overflow of the last unicode byte. Is it possible that the issue comes from this merging of adjacent codes without the check for the additional format requirement?

bdoubrov commented 1 month ago

The support for PDFBox version will stop after the next release 1.28. It is strongly recommended to switch to the Greenfield version with the continued long-term support

veraPDF / veraPDF-library

Different validation results with GreenField and PDFBox parser #1253