PDF-hul: various issues with parsing PDFs

Some issues noted about parsing PDFs:

{ and } are not PDF delimiter tokens except within Type 4 PostScript functions (i.e. they are PS delimiters only) so using them elsewhere is incorrect. This was a long-standing error in PDF specifications.
PDF-hul header check is for %PDF-1 but spec says it is %PDF- followed by any digit (0-9), . and another `digit so PDF 2.0 files should report as a PDF file, but with an unsupported PDF version until such time as you support PDF 2.0. JHOVE currently reports PDF 2.0 files as a bytestream which is incorrect. See here
PDF-hul crashes if a PDF hex-string contains EOL characters - this is permitted by the PDF spec as whitespace can occur in hex-strings and the EOLs are considered whitespace. (For what it is worth, hex-strings and literal strings are the only 2 types of PDF tokens or keywords that can span multiple lines).
there seem to be assumptions with PDF-hul-xx error codes that a key with an explicit null value is invalid whereas the PDF spec states that such keys should be ignored (same as not present). An easy test is to set /Annots null on any page and compare behaviour to not having an /Annots entry present.
Java exception gets thrown if cross-reference sub-section marker lines (of 2 integers) start with a negative number (i.e. for the object number).
FileSpecification.java does not account for the UF entry added with PDF 1.7. This was noticed from a code review.
there is something strange going on when encountering empty names (i.e. just a '/' followed by nothing, which is a valid PDF name). PDump correctly lists as a Name object with empty string "", but if 2 empty names are appended to a trailer dictionary (i.e. a valid key/value dictionary entry) then JHOVE doesn't work properly...
please consider adding support for UTF-8 text strings introduced with PDF 2.0. This was noted from a code review. Also note that UTF-8 strings do occur in some pre-PDF 2.0 files...

Possibly related - this blog post describes an issue where JHOVE's parsing goes wrong for some PDFs with mixed Unicode/octal encoded text strings (post links to example file).

I'd like to point out that the blog post mentioned is incorrect:

The PDF Reference defines: "For text strings encoded in Unicode, the first two bytes must be 254 followed by 255, representing the Unicode byte order marker, U+FEFF". A dual encoding of "\376\377" violates this, since the BOM character is no longer fitted in two bytes. Octal representation is intended to be used in the character encoding format defined in the PDF Reference (PDFDocEncoding). This should not be mixed with Unicode encoding. PDFDocEncoding is a superset of Latin-1, where "\376\377" resolves to "þÿ", not BOM.

\376\377 is the fully correct and valid UTF-16BE BoM for PDF strings when using octal. There is NOTHING WRONG with this. These are the first 2 bytes representing 254 and 255 as required by the PDF spec.

I would also very strongly suggest NOT referring to such an old PDF 1.4 specification! Please use both the ISO 32000 specifications, and especially checking for any clarified vendor-neutral wording in ISO 32000-2:2020.

@petervwyatt Yep, I already suspected this (but it's good to have this confirmed). Which actually makes JHOVE's behavior here even worse, because its inability to parse a perfectly valid file indirectly leads to the reporting of a completely unrelated validation error.

We are the writers of the referred blog (ghost story, in Halloween). We will update it a little bit, based on the discussion, with a changelog.

First of all, we are grateful for the discussion about our blog post. We’d like to clarify our views on the matter here below.

From the point of view of digital preservation, we still wouldn’t recommend having multiple (unnecessary) encodings, such as UTF-16BE added with octal encoding. I.e. it might not be a sensible practice. In the long-term, each encoding layer raises the risk of causing problems in the future. JHOVE probably is not the only software to get confused by multilayered encoding of metadata. In future migrations, these kinds of issues need to be identified and somehow handled using the software support available then. Unless we handle it today when encountering them. This is actually the main point of our blog. Should JHOVE give an info message about a string starting with "\376\377", which most likely has a multilayered encoding, instead of just skipping it?

About the PDF Reference:

We’ll update the quote in the blog to the current revision (ISO 32000-2:2020, ch. 7.9.2.2.1). The paragraph still describes 254 and 255 as the first two bytes of a text string, so in our case there is not really much of a difference from the previous wording, although we admit that it does not specifically deny re-encoding to a multilayered encoding.

ISO 32000-2:2020 is unclear regarding octal codes: In ch. 7.3.4.2, "\ddd" is described as "character code ddd". This may be understood so that a "character code" should resolve to a character when decoded (from some codepage), which can be confusing combined with using UTF-16BE having characters of 2-4 bytes. Instead of using the term "character code", we would use e.g. "code of a character byte" (i.e. it may also be part of a character) or more broadly "(octal) code of a byte". On the other hand, the ISO 32000-2:2020 also states in ch. 7.3.4.2: "However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described." We feel that the user of the standard can get confused and needs to interpret, whether "\ddd notation described" refers to the coding of a character or a byte (for each \ddd).

We have certainly learned more about the PDF file format thanks to the discussion here.

JHOVE probably is not the only software to get confused by multilayered encoding of metadata. In future migrations, these kinds of issues need to be identified and somehow handled using the software support available then.

Out of curiosity I did a little test using some of my favourite PDF mangling tools and libraries. First I created a modified version of the PDF in a Hex editor, where I changed the value of the XMP Producer field to "OPF Phantom". This way we can easily see what field(s) each tool actually reports.

Below the commands + results for alll tools/libraries.

ExifTool

exiftool -X phantom_modified_xmp.pdf

Result:

<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about='phantom_modified_xmp.pdf'
  xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
  xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
  xmlns:System='http://ns.exiftool.org/File/System/1.0/'
  xmlns:File='http://ns.exiftool.org/File/1.0/'
  xmlns:PDF='http://ns.exiftool.org/PDF/PDF/1.0/'
  xmlns:XMP-x='http://ns.exiftool.org/XMP/XMP-x/1.0/'
  xmlns:XMP-pdf='http://ns.exiftool.org/XMP/XMP-pdf/1.0/'>
 <ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
 <System:FileName>phantom_modified_xmp.pdf</System:FileName>
 <System:Directory>.</System:Directory>
 <System:FileSize>5.9 kB</System:FileSize>
 <System:FileModifyDate>2024:11:09 00:22:05+00:00</System:FileModifyDate>
 <System:FileAccessDate>2024:11:09 00:22:46+00:00</System:FileAccessDate>
 <System:FileInodeChangeDate>2024:11:09 00:22:05+00:00</System:FileInodeChangeDate>
 <System:FilePermissions>-rw-rw-r--</System:FilePermissions>
 <File:FileType>PDF</File:FileType>
 <File:FileTypeExtension>pdf</File:FileTypeExtension>
 <File:MIMEType>application/pdf</File:MIMEType>
 <PDF:PDFVersion>1.4</PDF:PDFVersion>
 <PDF:Linearized>No</PDF:Linearized>
 <PDF:PageCount>1</PDF:PageCount>
 <PDF:Title>Boo</PDF:Title>
 <PDF:CreateDate>2024:10:29 13:43:30Z</PDF:CreateDate>
 <PDF:Producer>PDF Phantom</PDF:Producer>
 <XMP-x:XMPToolkit>Image::ExifTool 12.71</XMP-x:XMPToolkit>
 <XMP-pdf:Producer>OPF Phantom</XMP-pdf:Producer>
</rdf:Description>
</rdf:RDF>

ExifTool correctly decodes the octal escape sequences (PDF:Producer), and also extracts the XMP value (XMP-pdf:Producer).

Pdfcpu

pdfcpu info phantom_modified_xmp.pdf

Result:

         PDF version: 1.4
          Page count: 1
           Page size: 595.28 x 841.89 points
............................................
               Title: Boo
              Author: 
             Subject: 
        PDF Producer: PDF Phantom
     Content creator: 
       Creation date: D:20241029134330Z00'00'
   Modification date: 
............................................
              Tagged: No
              Hybrid: No
          Linearized: No
  Using XRef streams: No
Using object streams: No
         Watermarked: No
............................................
           Encrypted: No
         Permissions: Full access

Pdfcpu correctly decodes the octal escape sequences (PDF Producer).

pdfinfo (Poppler)

pdfinfo phantom_modified_xmp.pdf

Result:

Title:          Boo
Producer:       PDF Phantom
CreationDate:   Tue Oct 29 14:43:30 2024 CET
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      5906 bytes
Optimized:      no
PDF version:    1.4

Poppler correctly decodes the octal escape sequences (Producer).

VeraPDF

verapdf --off --extract phantom_modified_xmp.pdf

Result includes:

<informationDict>
  <entry key="Title">Boo</entry>
  <entry key="Producer">PDF Phantom#x000000</entry>
  <entry key="CreationDate">2024-10-29T13:43:30.000Z</entry>
</informationDict>

VeraPDF does decode the octal escape sequences, but shows a null character at the end (edit: this is actually part of the string!).

Apache Tika

java -jar ~/tika/tika-app-2.9.2.jar phantom_modified_xmp.pdf

Result:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="pdf:docinfo:title" content="Boo"/>
<meta name="pdf:hasXFA" content="false"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2024-10-29T13:43:30Z"/>
<meta name="dc:format" content="application/pdf; version=1.4"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:hasCollection" content="false"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="Boo"/>
<meta name="Content-Length" content="5906"/>
<meta name="pdf:hasMarkedContent" content="false"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="pdf:producer" content="OPF Phantom"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="resourceName" content="phantom_modified_xmp.pdf"/>
<meta name="pdf:hasXMP" content="true"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="PDF Phantom�"/>
<meta name="pdf:docinfo:created" content="2024-10-29T13:43:30Z"/>
<title>Boo</title>
</head>
<body><div class="page"><p/>
</div>
</body></html>

Tika reports both strings, but like VeraPDF shows the null character at the end of the octal escape sequences (pdf:docinfo:producer) .

Qpdf

qpdf --json phantom_modified_xmp.pdf

Output contains:

"9 0 R": {
  "/CreationDate": "D:20241029134330Z00'00'",
  "/Producer": "PDF Phantom\u0000",
  "/Title": "Boo"
}

Qpdf reports the octal escape sequences, but like VeraPDF shows a null character at the end.

Pdftk

pdftk phantom_modified_xmp.pdf dump_data

Result:

InfoBegin
InfoKey: CreationDate
InfoValue: D:20241029134330Z00&apos;00&apos;
InfoBegin
InfoKey: Producer
InfoValue: PDF Phantom
InfoBegin
InfoKey: Title
InfoValue: Boo
PdfID0: 71a810587639eb130aefddee35e3c49d
PdfID1: 71a810587639eb130aefddee35e3c49d
NumberOfPages: 1
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595.276 841.89
PageMediaDimensions: 595.276 841.89

Pdftk correctly decodes the octal escape sequences.

PyMuPDF

Using this simple test script:

import pprint
import pymupdf

myPDF = "phantom_modified_xmp.pdf"

doc = pymupdf.open(myPDF)
metadata = doc.metadata
pprint.pp(metadata)

Result:

{'format': 'PDF 1.4',
 'title': 'Boo',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'PDF Phantom',
 'creationDate': "D:20241029134330Z00'00'",
 'modDate': '',
 'trapped': '',
 'encryption': None}

PyMUPDF does decode the octal escape sequences correctly.

Conclusion

All the above tools and libraries are able to decode the octal escape sequences. VeraPDF, Tika and Qpdf show a null character at the end of the producer string, but this character is also part of the source. So JHOVE's behaviour really seems to be the exception here.

Let's clarify a few things first:

firstly, octal escape sequences are NOT an "encoding" in PDF - they are a fundamental part of the lexical definition of PDF literal string objects. See clause 7.3.4. PDF processing software must support them or there will be a huge amount of issues! It is equivalent to not parsing numerics correctly.
data encoding of the bytes in PDF string objects (of which there are 2 lexical forms: literal strings and hex strings) is entirely independent of the lexical form used to store the string in a PDF file. Any string can be a literal or hex string unless stated otherwise in the PDF spec (PS. there are a few places where that is done to mandate the hex form)
the data encoding of the bytes in PDF string objects is defined by clause 7.9.2. and the Figure 7 hierarchy. Every key value or array entry that is a PDF string uses very precise language as to which "branch"(es) in Figure 7 is/are being referenced - only strings specified as "string" or "text string" can contain Unicode-encoded data (as indicated by the leading BoM bytes if present). If something is a "byte string" then the byte data has no inherently defined encoding built into the string object: it is entirely context-dependent - just because it starts with what looks like a Unicode BoM does not mean it is Unicode. And all this is independent of the lexical form of the string (literal or hex).

In the PDF file in question, the conventional PDF DocInfo Producer key is formally specified as a "text string" so it might be Unicode if the correct BoM bytes are present - like they are as the octal pair to indicate UTF-16BE. This same string could have been a hex string too. The technically correct string that is stored is "PDF Phantom\" however a lot of software will swallow the explicit \ mostly because of the way that programming languages store their strings (e.g. C/C++) or when passed to the operating system since many O/S output systems are UTF-8 and trim to be only printables. In this case, a human might assess that the final \ byte has no value - but it may have been intentional by the producing application (perhaps it indicates a version or is some other form of proprietary data - we don't know for sure). When this data is transcoded to UTF-8 to save into the XMP Metadata stream, the technically correct solution is to ensure that this trailing \ is again included (and 0x00 is permitted in a UTF-8 sequence) - however whitespace and unprintables trimming or programming language assumptions will again kick in.

There is also nothing in the core PDF spec that states the conventional DocInfo dictionary and XMP Metadata stream values have to be identical or limit the information in any way. It makes very good sense but it is not mandated. Time-honoured convention based on limitations in the UI of viewers means that very long, multi-line, or other advanced uses of Unicode that is technically permitted with PDF Unicode strings should not be used.

A more aggressive test would be to put \ or other non-printables mid-string in the PDF DocInfo Producer key and see what happens. Does the data get truncated at the first NUL or other non-printable? Is the output from tools mangled? A typical example is that PDF Unicode text strings can include BCP-47 2-character language escape sequences - some tools display these, some tools don't (and if you use a screen-reader or other assistive technologies then these might be very important for you!).

@petervwyatt Thanks for the additional clarifications, I just updated my last comment to clear up (hopefully!) the confusing terminology.

openpreserve / jhove