Open petervwyatt opened 4 months ago
Possibly related - this blog post describes an issue where JHOVE's parsing goes wrong for some PDFs with mixed Unicode/octal encoded text strings (post links to example file).
I'd like to point out that the blog post mentioned is incorrect:
The PDF Reference defines: "For text strings encoded in Unicode, the first two bytes must be 254 followed by 255, representing the Unicode byte order marker, U+FEFF". A dual encoding of "\376\377" violates this, since the BOM character is no longer fitted in two bytes. Octal representation is intended to be used in the character encoding format defined in the PDF Reference (PDFDocEncoding). This should not be mixed with Unicode encoding. PDFDocEncoding is a superset of Latin-1, where "\376\377" resolves to "þÿ", not BOM.
\376\377
is the fully correct and valid UTF-16BE BoM for PDF strings when using octal. There is NOTHING WRONG with this. These are the first 2 bytes representing 254 and 255 as required by the PDF spec.
I would also very strongly suggest NOT referring to such an old PDF 1.4 specification! Please use both the ISO 32000 specifications, and especially checking for any clarified vendor-neutral wording in ISO 32000-2:2020.
@petervwyatt Yep, I already suspected this (but it's good to have this confirmed). Which actually makes JHOVE's behavior here even worse, because its inability to parse a perfectly valid file indirectly leads to the reporting of a completely unrelated validation error.
We are the writers of the referred blog (ghost story, in Halloween). We will update it a little bit, based on the discussion, with a changelog.
First of all, we are grateful for the discussion about our blog post. We’d like to clarify our views on the matter here below.
From the point of view of digital preservation, we still wouldn’t recommend having multiple (unnecessary) encodings, such as UTF-16BE added with octal encoding. I.e. it might not be a sensible practice. In the long-term, each encoding layer raises the risk of causing problems in the future. JHOVE probably is not the only software to get confused by multilayered encoding of metadata. In future migrations, these kinds of issues need to be identified and somehow handled using the software support available then. Unless we handle it today when encountering them. This is actually the main point of our blog. Should JHOVE give an info message about a string starting with "\376\377", which most likely has a multilayered encoding, instead of just skipping it?
About the PDF Reference:
We’ll update the quote in the blog to the current revision (ISO 32000-2:2020, ch. 7.9.2.2.1). The paragraph still describes 254 and 255 as the first two bytes of a text string, so in our case there is not really much of a difference from the previous wording, although we admit that it does not specifically deny re-encoding to a multilayered encoding.
ISO 32000-2:2020 is unclear regarding octal codes: In ch. 7.3.4.2, "\ddd" is described as "character code ddd". This may be understood so that a "character code" should resolve to a character when decoded (from some codepage), which can be confusing combined with using UTF-16BE having characters of 2-4 bytes. Instead of using the term "character code", we would use e.g. "code of a character byte" (i.e. it may also be part of a character) or more broadly "(octal) code of a byte". On the other hand, the ISO 32000-2:2020 also states in ch. 7.3.4.2: "However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described." We feel that the user of the standard can get confused and needs to interpret, whether "\ddd notation described" refers to the coding of a character or a byte (for each \ddd).
We have certainly learned more about the PDF file format thanks to the discussion here.
JHOVE probably is not the only software to get confused by multilayered encoding of metadata. In future migrations, these kinds of issues need to be identified and somehow handled using the software support available then.
Out of curiosity I did a little test using some of my favourite PDF mangling tools and libraries. First I created a modified version of the PDF in a Hex editor, where I changed the value of the XMP Producer field to "OPF Phantom". This way we can easily see what field(s) each tool actually reports.
Below the commands + results for alll tools/libraries.
exiftool -X phantom_modified_xmp.pdf
Result:
<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about='phantom_modified_xmp.pdf'
xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
xmlns:System='http://ns.exiftool.org/File/System/1.0/'
xmlns:File='http://ns.exiftool.org/File/1.0/'
xmlns:PDF='http://ns.exiftool.org/PDF/PDF/1.0/'
xmlns:XMP-x='http://ns.exiftool.org/XMP/XMP-x/1.0/'
xmlns:XMP-pdf='http://ns.exiftool.org/XMP/XMP-pdf/1.0/'>
<ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
<System:FileName>phantom_modified_xmp.pdf</System:FileName>
<System:Directory>.</System:Directory>
<System:FileSize>5.9 kB</System:FileSize>
<System:FileModifyDate>2024:11:09 00:22:05+00:00</System:FileModifyDate>
<System:FileAccessDate>2024:11:09 00:22:46+00:00</System:FileAccessDate>
<System:FileInodeChangeDate>2024:11:09 00:22:05+00:00</System:FileInodeChangeDate>
<System:FilePermissions>-rw-rw-r--</System:FilePermissions>
<File:FileType>PDF</File:FileType>
<File:FileTypeExtension>pdf</File:FileTypeExtension>
<File:MIMEType>application/pdf</File:MIMEType>
<PDF:PDFVersion>1.4</PDF:PDFVersion>
<PDF:Linearized>No</PDF:Linearized>
<PDF:PageCount>1</PDF:PageCount>
<PDF:Title>Boo</PDF:Title>
<PDF:CreateDate>2024:10:29 13:43:30Z</PDF:CreateDate>
<PDF:Producer>PDF Phantom</PDF:Producer>
<XMP-x:XMPToolkit>Image::ExifTool 12.71</XMP-x:XMPToolkit>
<XMP-pdf:Producer>OPF Phantom</XMP-pdf:Producer>
</rdf:Description>
</rdf:RDF>
ExifTool correctly decodes the octal escape sequences (PDF:Producer), and also extracts the XMP value (XMP-pdf:Producer).
pdfcpu info phantom_modified_xmp.pdf
Result:
PDF version: 1.4
Page count: 1
Page size: 595.28 x 841.89 points
............................................
Title: Boo
Author:
Subject:
PDF Producer: PDF Phantom
Content creator:
Creation date: D:20241029134330Z00'00'
Modification date:
............................................
Tagged: No
Hybrid: No
Linearized: No
Using XRef streams: No
Using object streams: No
Watermarked: No
............................................
Encrypted: No
Permissions: Full access
Pdfcpu correctly decodes the octal escape sequences (PDF Producer).
pdfinfo phantom_modified_xmp.pdf
Result:
Title: Boo
Producer: PDF Phantom
CreationDate: Tue Oct 29 14:43:30 2024 CET
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 595.276 x 841.89 pts (A4)
Page rot: 0
File size: 5906 bytes
Optimized: no
PDF version: 1.4
Poppler correctly decodes the octal escape sequences (Producer).
verapdf --off --extract phantom_modified_xmp.pdf
Result includes:
<informationDict>
<entry key="Title">Boo</entry>
<entry key="Producer">PDF Phantom#x000000</entry>
<entry key="CreationDate">2024-10-29T13:43:30.000Z</entry>
</informationDict>
VeraPDF does decode the octal escape sequences, but shows a null character at the end (edit: this is actually part of the string!).
java -jar ~/tika/tika-app-2.9.2.jar phantom_modified_xmp.pdf
Result:
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="pdf:docinfo:title" content="Boo"/>
<meta name="pdf:hasXFA" content="false"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2024-10-29T13:43:30Z"/>
<meta name="dc:format" content="application/pdf; version=1.4"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:hasCollection" content="false"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="Boo"/>
<meta name="Content-Length" content="5906"/>
<meta name="pdf:hasMarkedContent" content="false"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="pdf:producer" content="OPF Phantom"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="resourceName" content="phantom_modified_xmp.pdf"/>
<meta name="pdf:hasXMP" content="true"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="PDF Phantom�"/>
<meta name="pdf:docinfo:created" content="2024-10-29T13:43:30Z"/>
<title>Boo</title>
</head>
<body><div class="page"><p/>
</div>
</body></html>
Tika reports both strings, but like VeraPDF shows the null character at the end of the octal escape sequences (pdf:docinfo:producer) .
qpdf --json phantom_modified_xmp.pdf
Output contains:
"9 0 R": {
"/CreationDate": "D:20241029134330Z00'00'",
"/Producer": "PDF Phantom\u0000",
"/Title": "Boo"
}
Qpdf reports the octal escape sequences, but like VeraPDF shows a null character at the end.
pdftk phantom_modified_xmp.pdf dump_data
Result:
InfoBegin
InfoKey: CreationDate
InfoValue: D:20241029134330Z00'00'
InfoBegin
InfoKey: Producer
InfoValue: PDF Phantom
InfoBegin
InfoKey: Title
InfoValue: Boo
PdfID0: 71a810587639eb130aefddee35e3c49d
PdfID1: 71a810587639eb130aefddee35e3c49d
NumberOfPages: 1
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595.276 841.89
PageMediaDimensions: 595.276 841.89
Pdftk correctly decodes the octal escape sequences.
Using this simple test script:
import pprint
import pymupdf
myPDF = "phantom_modified_xmp.pdf"
doc = pymupdf.open(myPDF)
metadata = doc.metadata
pprint.pp(metadata)
Result:
{'format': 'PDF 1.4',
'title': 'Boo',
'author': '',
'subject': '',
'keywords': '',
'creator': '',
'producer': 'PDF Phantom',
'creationDate': "D:20241029134330Z00'00'",
'modDate': '',
'trapped': '',
'encryption': None}
PyMUPDF does decode the octal escape sequences correctly.
All the above tools and libraries are able to decode the octal escape sequences. VeraPDF, Tika and Qpdf show a null character at the end of the producer string, but this character is also part of the source. So JHOVE's behaviour really seems to be the exception here.
Let's clarify a few things first:
firstly, octal escape sequences are NOT an "encoding" in PDF - they are a fundamental part of the lexical definition of PDF literal string objects. See clause 7.3.4. PDF processing software must support them or there will be a huge amount of issues! It is equivalent to not parsing numerics correctly.
data encoding of the bytes in PDF string objects (of which there are 2 lexical forms: literal strings and hex strings) is entirely independent of the lexical form used to store the string in a PDF file. Any string can be a literal or hex string unless stated otherwise in the PDF spec (PS. there are a few places where that is done to mandate the hex form)
the data encoding of the bytes in PDF string objects is defined by clause 7.9.2. and the Figure 7 hierarchy. Every key value or array entry that is a PDF string uses very precise language as to which "branch"(es) in Figure 7 is/are being referenced - only strings specified as "string" or "text string" can contain Unicode-encoded data (as indicated by the leading BoM bytes if present). If something is a "byte string" then the byte data has no inherently defined encoding built into the string object: it is entirely context-dependent - just because it starts with what looks like a Unicode BoM does not mean it is Unicode. And all this is independent of the lexical form of the string (literal or hex).
In the PDF file in question, the conventional PDF DocInfo Producer key is formally specified as a "text string" so it might be Unicode if the correct BoM bytes are present - like they are as the octal pair to indicate UTF-16BE. This same string could have been a hex string too. The technically correct string that is stored is "PDF Phantom\0x00
is permitted in a UTF-8 sequence) - however whitespace and unprintables trimming or programming language assumptions will again kick in.
There is also nothing in the core PDF spec that states the conventional DocInfo dictionary and XMP Metadata stream values have to be identical or limit the information in any way. It makes very good sense but it is not mandated. Time-honoured convention based on limitations in the UI of viewers means that very long, multi-line, or other advanced uses of Unicode that is technically permitted with PDF Unicode strings should not be used.
A more aggressive test would be to put \
@petervwyatt Thanks for the additional clarifications, I just updated my last comment to clear up (hopefully!) the confusing terminology.
Some issues noted about parsing PDFs:
{
and}
are not PDF delimiter tokens except within Type 4 PostScript functions (i.e. they are PS delimiters only) so using them elsewhere is incorrect. This was a long-standing error in PDF specifications.PDF-hul header check is for
%PDF-1
but spec says it is%PDF-
followed by any digit (0
-9
),.
and another `digit so PDF 2.0 files should report as a PDF file, but with an unsupported PDF version until such time as you support PDF 2.0. JHOVE currently reports PDF 2.0 files as a bytestream which is incorrect. See herePDF-hul crashes if a PDF hex-string contains EOL characters - this is permitted by the PDF spec as whitespace can occur in hex-strings and the EOLs are considered whitespace. (For what it is worth, hex-strings and literal strings are the only 2 types of PDF tokens or keywords that can span multiple lines).
there seem to be assumptions with PDF-hul-xx error codes that a key with an explicit null value is invalid whereas the PDF spec states that such keys should be ignored (same as not present). An easy test is to set
/Annots null
on any page and compare behaviour to not having an/Annots
entry present.Java exception gets thrown if cross-reference sub-section marker lines (of 2 integers) start with a negative number (i.e. for the object number).
FileSpecification.java does not account for the UF entry added with PDF 1.7. This was noticed from a code review.
there is something strange going on when encountering empty names (i.e. just a '/' followed by nothing, which is a valid PDF name). PDump correctly lists as a Name object with empty string
""
, but if 2 empty names are appended to a trailer dictionary (i.e. a valid key/value dictionary entry) then JHOVE doesn't work properly...please consider adding support for UTF-8 text strings introduced with PDF 2.0. This was noted from a code review. Also note that UTF-8 strings do occur in some pre-PDF 2.0 files...