openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
163 stars 78 forks source link

Problem uncompessing PDF object ending by "CR" character #870

Open SebastienDegand opened 1 year ago

SebastienDegand commented 1 year ago

I have some PDF file that cannot be validated with this error:

"errorMessages": [ "Malformed cross reference stream (null)" ]

Unfortunately I can't share the PDF files but making some debug I found the problematic object and it's this one:

image

here is the base64 of the object if it can help: 931_b64.txt

I don't really understand what this object is but the compressed stream end with characters "CR" and "LF" (binary equivalent). However according to the Tokenizer code of Jhove I found this (and the execution execute this):

edu.harvard.hul.ois.jhove.module.pdf.Tokenizer

else if (_state == (State.ENDSTREAM)) {
    if (isDelimiter (_ch) || isWhitespace (_ch)) {
        _state = State.WHITESPACE;
        // The line break, if any, before endstream
        // is not counted in the length.
        if (prelastch == CR && lastch == LF) {
            length -= 2;
        }

So it removes the 2 characters "CR" and "LF" BUT the "CR" character belongs to the compressed data so zlib decompression failed because the stream is not complete. I made a test and zlib is indeed able to uncompress the data by adding the "CR" character but not without. Also the length declared in the object header announce "R/Length 2069" but removing the "CR" character the length is 2068.

Is it a limitation of jhove ? a bug ? Or this should not happen in a PDF ?

Thanks

carlwilson commented 10 months ago

Hi @SebastienDegand, thanks for reporting this. Given that the object is compressed and the stream should allow all encodings/compression, then I think it's likely that this is a JHOVE bug. The sequence should be legal and it's JHOVE's token parsing logic that's been found wanting.