Open SebastienDegand opened 1 year ago
Hi @SebastienDegand, thanks for reporting this. Given that the object is compressed and the stream should allow all encodings/compression, then I think it's likely that this is a JHOVE bug. The sequence should be legal and it's JHOVE's token parsing logic that's been found wanting.
I have some PDF file that cannot be validated with this error:
"errorMessages": [ "Malformed cross reference stream (null)" ]
Unfortunately I can't share the PDF files but making some debug I found the problematic object and it's this one:
here is the base64 of the object if it can help: 931_b64.txt
I don't really understand what this object is but the compressed stream end with characters "CR" and "LF" (binary equivalent). However according to the Tokenizer code of Jhove I found this (and the execution execute this):
edu.harvard.hul.ois.jhove.module.pdf.Tokenizer
So it removes the 2 characters "CR" and "LF" BUT the "CR" character belongs to the compressed data so zlib decompression failed because the stream is not complete. I made a test and zlib is indeed able to uncompress the data by adding the "CR" character but not without. Also the length declared in the object header announce "R/Length 2069" but removing the "CR" character the length is 2068.
Is it a limitation of jhove ? a bug ? Or this should not happen in a PDF ?
Thanks