openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

PDF-hul: various issues with parsing PDFs #927

Open petervwyatt opened 4 months ago

petervwyatt commented 4 months ago

Some issues noted about parsing PDFs:

bitsgalore commented 1 week ago

Possibly related - this blog post describes an issue where JHOVE's parsing goes wrong for some PDFs with mixed Unicode/octal encoded text strings (post links to example file).

petervwyatt commented 1 week ago

I'd like to point out that the blog post mentioned is incorrect:

The PDF Reference defines: "For text strings encoded in Unicode, the first two bytes must be 254 followed by 255, representing the Unicode byte order marker, U+FEFF". A dual encoding of "\376\377" violates this, since the BOM character is no longer fitted in two bytes. Octal representation is intended to be used in the character encoding format defined in the PDF Reference (PDFDocEncoding). This should not be mixed with Unicode encoding. PDFDocEncoding is a superset of Latin-1, where "\376\377" resolves to "þÿ", not BOM.

\376\377 is the fully correct and valid UTF-16BE BoM for PDF strings when using octal. There is NOTHING WRONG with this. These are the first 2 bytes representing 254 and 255 as required by the PDF spec.

I would also very strongly suggest NOT referring to such an old PDF 1.4 specification! Please use both the ISO 32000 specifications, and especially checking for any clarified vendor-neutral wording in ISO 32000-2:2020.

bitsgalore commented 1 week ago

@petervwyatt Yep, I already suspected this (but it's good to have this confirmed). Which actually makes JHOVE's behavior here even worse, because its inability to parse a perfectly valid file indirectly leads to the reporting of a completely unrelated validation error.

jmlehton commented 6 days ago

We are the writers of the referred blog (ghost story, in Halloween). We will update it a little bit, based on the discussion, with a changelog.

First of all, we are grateful for the discussion about our blog post. We’d like to clarify our views on the matter here below.

From the point of view of digital preservation, we still wouldn’t recommend having multiple (unnecessary) encodings, such as UTF-16BE added with octal encoding. I.e. it might not be a sensible practice. In the long-term, each encoding layer raises the risk of causing problems in the future. JHOVE probably is not the only software to get confused by multilayered encoding of metadata. In future migrations, these kinds of issues need to be identified and somehow handled using the software support available then. Unless we handle it today when encountering them. This is actually the main point of our blog. Should JHOVE give an info message about a string starting with "\376\377", which most likely has a multilayered encoding, instead of just skipping it?

About the PDF Reference:

We’ll update the quote in the blog to the current revision (ISO 32000-2:2020, ch. 7.9.2.2.1). The paragraph still describes 254 and 255 as the first two bytes of a text string, so in our case there is not really much of a difference from the previous wording, although we admit that it does not specifically deny re-encoding to a multilayered encoding.

ISO 32000-2:2020 is unclear regarding octal codes: In ch. 7.3.4.2, "\ddd" is described as "character code ddd". This may be understood so that a "character code" should resolve to a character when decoded (from some codepage), which can be confusing combined with using UTF-16BE having characters of 2-4 bytes. Instead of using the term "character code", we would use e.g. "code of a character byte" (i.e. it may also be part of a character) or more broadly "(octal) code of a byte". On the other hand, the ISO 32000-2:2020 also states in ch. 7.3.4.2: "However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described." We feel that the user of the standard can get confused and needs to interpret, whether "\ddd notation described" refers to the coding of a character or a byte (for each \ddd).

We have certainly learned more about the PDF file format thanks to the discussion here.

bitsgalore commented 5 days ago

JHOVE probably is not the only software to get confused by multilayered encoding of metadata. In future migrations, these kinds of issues need to be identified and somehow handled using the software support available then.

Out of curiosity I did a little test using some of my favourite PDF mangling tools and libraries. First I created a modified version of the PDF in a Hex editor, where I changed the value of the XMP Producer field to "OPF Phantom". This way we can easily see what field(s) each tool actually reports.

Below the commands + results for alll tools/libraries.

ExifTool

exiftool -X phantom_modified_xmp.pdf

Result:

<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about='phantom_modified_xmp.pdf'
  xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
  xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
  xmlns:System='http://ns.exiftool.org/File/System/1.0/'
  xmlns:File='http://ns.exiftool.org/File/1.0/'
  xmlns:PDF='http://ns.exiftool.org/PDF/PDF/1.0/'
  xmlns:XMP-x='http://ns.exiftool.org/XMP/XMP-x/1.0/'
  xmlns:XMP-pdf='http://ns.exiftool.org/XMP/XMP-pdf/1.0/'>
 <ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
 <System:FileName>phantom_modified_xmp.pdf</System:FileName>
 <System:Directory>.</System:Directory>
 <System:FileSize>5.9 kB</System:FileSize>
 <System:FileModifyDate>2024:11:09 00:22:05+00:00</System:FileModifyDate>
 <System:FileAccessDate>2024:11:09 00:22:46+00:00</System:FileAccessDate>
 <System:FileInodeChangeDate>2024:11:09 00:22:05+00:00</System:FileInodeChangeDate>
 <System:FilePermissions>-rw-rw-r--</System:FilePermissions>
 <File:FileType>PDF</File:FileType>
 <File:FileTypeExtension>pdf</File:FileTypeExtension>
 <File:MIMEType>application/pdf</File:MIMEType>
 <PDF:PDFVersion>1.4</PDF:PDFVersion>
 <PDF:Linearized>No</PDF:Linearized>
 <PDF:PageCount>1</PDF:PageCount>
 <PDF:Title>Boo</PDF:Title>
 <PDF:CreateDate>2024:10:29 13:43:30Z</PDF:CreateDate>
 <PDF:Producer>PDF Phantom</PDF:Producer>
 <XMP-x:XMPToolkit>Image::ExifTool 12.71</XMP-x:XMPToolkit>
 <XMP-pdf:Producer>OPF Phantom</XMP-pdf:Producer>
</rdf:Description>
</rdf:RDF>

ExifTool correctly decodes the octal escape sequences (PDF:Producer), and also extracts the XMP value (XMP-pdf:Producer).

Pdfcpu

pdfcpu info phantom_modified_xmp.pdf

Result:

         PDF version: 1.4
          Page count: 1
           Page size: 595.28 x 841.89 points
............................................
               Title: Boo
              Author: 
             Subject: 
        PDF Producer: PDF Phantom
     Content creator: 
       Creation date: D:20241029134330Z00'00'
   Modification date: 
............................................
              Tagged: No
              Hybrid: No
          Linearized: No
  Using XRef streams: No
Using object streams: No
         Watermarked: No
............................................
           Encrypted: No
         Permissions: Full access

Pdfcpu correctly decodes the octal escape sequences (PDF Producer).

pdfinfo (Poppler)

pdfinfo phantom_modified_xmp.pdf

Result:

Title:          Boo
Producer:       PDF Phantom
CreationDate:   Tue Oct 29 14:43:30 2024 CET
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      5906 bytes
Optimized:      no
PDF version:    1.4

Poppler correctly decodes the octal escape sequences (Producer).

VeraPDF

verapdf --off --extract phantom_modified_xmp.pdf

Result includes:

<informationDict>
  <entry key="Title">Boo</entry>
  <entry key="Producer">PDF Phantom#x000000</entry>
  <entry key="CreationDate">2024-10-29T13:43:30.000Z</entry>
</informationDict>

VeraPDF does decode the octal escape sequences, but shows a null character at the end (edit: this is actually part of the string!).

Apache Tika

java -jar ~/tika/tika-app-2.9.2.jar phantom_modified_xmp.pdf

Result:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="pdf:docinfo:title" content="Boo"/>
<meta name="pdf:hasXFA" content="false"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2024-10-29T13:43:30Z"/>
<meta name="dc:format" content="application/pdf; version=1.4"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:hasCollection" content="false"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="Boo"/>
<meta name="Content-Length" content="5906"/>
<meta name="pdf:hasMarkedContent" content="false"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="pdf:producer" content="OPF Phantom"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="resourceName" content="phantom_modified_xmp.pdf"/>
<meta name="pdf:hasXMP" content="true"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="PDF Phantom�"/>
<meta name="pdf:docinfo:created" content="2024-10-29T13:43:30Z"/>
<title>Boo</title>
</head>
<body><div class="page"><p/>
</div>
</body></html>

Tika reports both strings, but like VeraPDF shows the null character at the end of the octal escape sequences (pdf:docinfo:producer) .

Qpdf

qpdf --json phantom_modified_xmp.pdf

Output contains:

"9 0 R": {
  "/CreationDate": "D:20241029134330Z00'00'",
  "/Producer": "PDF Phantom\u0000",
  "/Title": "Boo"
}

Qpdf reports the octal escape sequences, but like VeraPDF shows a null character at the end.

Pdftk

pdftk phantom_modified_xmp.pdf dump_data

Result:

InfoBegin
InfoKey: CreationDate
InfoValue: D:20241029134330Z00&apos;00&apos;
InfoBegin
InfoKey: Producer
InfoValue: PDF Phantom
InfoBegin
InfoKey: Title
InfoValue: Boo
PdfID0: 71a810587639eb130aefddee35e3c49d
PdfID1: 71a810587639eb130aefddee35e3c49d
NumberOfPages: 1
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595.276 841.89
PageMediaDimensions: 595.276 841.89

Pdftk correctly decodes the octal escape sequences.

PyMuPDF

Using this simple test script:

import pprint
import pymupdf

myPDF = "phantom_modified_xmp.pdf"

doc = pymupdf.open(myPDF)
metadata = doc.metadata
pprint.pp(metadata)

Result:

{'format': 'PDF 1.4',
 'title': 'Boo',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'PDF Phantom',
 'creationDate': "D:20241029134330Z00'00'",
 'modDate': '',
 'trapped': '',
 'encryption': None}

PyMUPDF does decode the octal escape sequences correctly.

Conclusion

All the above tools and libraries are able to decode the octal escape sequences. VeraPDF, Tika and Qpdf show a null character at the end of the producer string, but this character is also part of the source. So JHOVE's behaviour really seems to be the exception here.

petervwyatt commented 5 days ago

Let's clarify a few things first:

In the PDF file in question, the conventional PDF DocInfo Producer key is formally specified as a "text string" so it might be Unicode if the correct BoM bytes are present - like they are as the octal pair to indicate UTF-16BE. This same string could have been a hex string too. The technically correct string that is stored is "PDF Phantom\" however a lot of software will swallow the explicit \ mostly because of the way that programming languages store their strings (e.g. C/C++) or when passed to the operating system since many O/S output systems are UTF-8 and trim to be only printables. In this case, a human might assess that the final \ byte has no value - but it may have been intentional by the producing application (perhaps it indicates a version or is some other form of proprietary data - we don't know for sure). When this data is transcoded to UTF-8 to save into the XMP Metadata stream, the technically correct solution is to ensure that this trailing \ is again included (and 0x00 is permitted in a UTF-8 sequence) - however whitespace and unprintables trimming or programming language assumptions will again kick in.

There is also nothing in the core PDF spec that states the conventional DocInfo dictionary and XMP Metadata stream values have to be identical or limit the information in any way. It makes very good sense but it is not mandated. Time-honoured convention based on limitations in the UI of viewers means that very long, multi-line, or other advanced uses of Unicode that is technically permitted with PDF Unicode strings should not be used.

A more aggressive test would be to put \ or other non-printables mid-string in the PDF DocInfo Producer key and see what happens. Does the data get truncated at the first NUL or other non-printable? Is the output from tools mangled? A typical example is that PDF Unicode text strings can include BCP-47 2-character language escape sequences - some tools display these, some tools don't (and if you use a screen-reader or other assistive technologies then these might be very important for you!).

bitsgalore commented 2 hours ago

@petervwyatt Thanks for the additional clarifications, I just updated my last comment to clear up (hopefully!) the confusing terminology.