openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
162 stars 78 forks source link

PDF: Problem parsing Unicode text strings #277

Open david-russo opened 6 years ago

david-russo commented 6 years ago

Dev Effort

1D

Description

The Title field of this document's information dictionary contains a byte sequence which causes the PDF parser to read beyond the end of the text string and interpret the succeeding bytes as additional Unicode characters, eventually leading to a NullPointerException. The problematic byte sequence starts at offset 146AA4, and may be part of a backslash escape sequence.

Problem found in JHOVE 1.16.7, PDF-hul 1.9.

samalloing commented 4 years ago

Hi @carlwilson ,

Can I take this issue next?

Sam

samalloing commented 4 years ago

Hi Carl,

I want to report back on this issue. I found a solution, but I'm not happy with it yet, because I wonder if there is a more generic solution. I didn't do a pull request yet, but left the commit on my github: https://github.com/samalloing/jhove/commit/e12937d97f7827f597d6433f673c402603dac4ed. The problem is that after the backslash at the position indicated in the issue description there are two characters that are single byte instead of double byte. This mess up the processing of the string after that. The first character is an escaped enter. I translated it to 0 and continue the state to LITERAL_UTF16_2. This makes it work. So in code this means in the readbackslash method I add an extra case for the 0xd character and return 0 as it is an error. On the processLiteral method where the UTF-16 is processed I set the state for invalid cases to _state = State.LITERAL_UTF16_2. So I'm not sure if this fix is good to add as a solution for the issue.

I also corrected two other problems. One is that the PDF_HUL_149 error wasn't showing up in the error messages. What happens is the error message is added to PDF_HUL_122. So the ID from PDF_HUL_149 is removed, but the message is added to PDF_HUL_122. I don't understand why this is happening. It is also not clear what the purpose is of this. Because you don't need PDF_HUL_149 then and this is the only occurrence. The solution could be to remove PDF_HUL_149 or just to add PDF_HUL_149 as an ID. I choose to add PDF_HUL_149 as an ID. This leaves the problem that because an invalid pdf exception is thrown that we also come in to a catch part of addDestination method and the message (not the ID) is added to PDF_HUL_122 as well. So I would suggest removing the invalidPdfException and leave only the PDF_HUL_149 error. I also added RepInfo to resolveIndirectDest for this to work.

The last fix is in the output of jhove in Outlines - Item - Destination the java object is printed instead of the contents of the Destination. I added the getStringValue to dest.getIndirectDest() https://github.com/openpreserve/jhove/blob/de1f5fd8427a0e8365b44ec095e34bf4d2cde8ad/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L4041. This is just a bug. So this is save to add. I'm not sure about the other fixes.

Thanks for any feedback on these fixes.

Sam

david-russo commented 4 years ago

The relevant part of the PDF (1.7) spec. seems to be section 7.3.4.2 "Literal Strings", which includes:

A conforming writer may split a literal string across multiple lines. The REVERSE SOLIDUS (5Ch) (backslash character) at the end of a line shall be used to indicate that the string continues on the following line. A conforming reader shall disregard the REVERSE SOLIDUS and the end-of-line marker following it when reading the string; the resulting string value shall be identical to that which would be read if the string were not split.

EXAMPLE 2  (These \
           two strings \
           are the same.)
           (These two strings are the same.)

Which seems to indicate that if we come across an escaped newline (5C0A or 5C0D) in a string, we should disregard it and continue on as if it hadn't been there.

The problematic newline escape sequence (5C0D) appears to have been inserted right in the middle of what should be a UTF-16-encoded "a" (0061) and part of the word "at" (0061 0074):

... 005C 0D61 0074 ...

So if we have the parser discard any escaped newlines, that should allow the recombination of the a's first byte (00) with its last byte (61) to form a valid UTF-16 character.

The key thing to recognize here is that the leading 00s don't belong to the 5C ("\") because the "\" isn't a (double-byte) UTF-16-encoded character in the string, but instead part of a PDF-specific escape syntax apparently encoded in (single-byte) ASCII. If the escape sequence were encoded in UTF-16, it would have been 005C 000D.

carlwilson commented 4 years ago

Am just completing a sweep up the notifications and it's too late to get my head into this so leaving this one for a less sleepy head tomorrow.

MartinSpeller commented 4 years ago

PDF: Problem parsing Unicode text strings #277 - Assigned to samalloing

samalloing commented 4 years ago

Thanks @david-russo , this helps a lot. I'll look into this!

@carlwilson, no problem. Take your time

@MartinSpeller , do you expect me to do something with this trello task? Because I don't have access to the Trello board.

Sam

jackdos commented 4 years ago

@david-russo I wouldn't expect that an escaped new-line character to be written in the middle of another character, which is what you seem to be suggesting here? That sounds like the writer treating the UTF-16 string as UTF-8 when re-formatting something? Is that potentially an error in formatting that the parser should be recognising? UTF-16 parsing should find 00 5c as a single backslash character, could we say that that character followed by an immediate single-byte LF or CR is actually an error?

david-russo commented 4 years ago

At first I thought it might've been an error in the writer as well, but Adobe's own Acrobat Distiller 8 & 9 seem to write titles like this (based on Producer metadata), and Adobe Reader has no trouble correctly displaying any of those I've come across, so I suspect this may not be an invalid way of encoding escape sequences in UTF-16 string literals.

Any characters may appear in a string except unbalanced parentheses and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this subclause...

Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in "Table 3 — Escape sequences in literal strings". If the character following the REVERSE SOLIDUS is not one of those shown in "Table 3 ...", the REVERSE SOLIDUS shall be ignored.

Table 3 shows the escape sequences and their meanings, eg: "\n = LINE FEED (0Ah) (LF)". From that part of the spec. and its associated table, it seems like any appearance of a 5C byte should start an escape sequence of some kind, or be ignored.

Interpreting the spec. with the assumption that these encodings were valid led me to my earlier reasoning that the escape sequence characters (and the characters they encode) are always written as 8-bits each, and should be removed from the character sequence entirely where appropriate (new lines), or replaced with their intended 8-bit character (such as an escaped left parenthesis (28h)), before rendering for display.

Of course replacing two 8-bit characters with one 8-bit character (or none) doesn't make an entire 16-bit UTF-16 character. But because all (?) valid escape sequences seem to result in characters that fit in a single byte, I'm guessing the PDF writer may automatically write the first UTF-16 byte as 00h, which would make it compatible with anything that follows in the ASCII range. This should allow the reader to just throw out all 5Cs (and any following newlines) as it comes across them, leaving the remaining 8-bit character that was escaped to complete the last byte of the UTF-16 character.

It does seem a little convoluted, but it might be because string literals can be in a number of different encodings (e.g. UTF, PDFDoc), and these escape sequences appear to apply to all types of encoding, so it might be expected that all string literals are subject to a two-part process of stripping and replacing escape sequences from what is essentially a binary stream before trying to decode that stream as characters.

Argh — too. many. words.