openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
168 stars 79 forks source link

PDF-hul: NegativeArraySizeException in PDF trailer due to CR as newline #935

Open matthias-fratz-bsz opened 1 month ago

matthias-fratz-bsz commented 1 month ago

So, another weird issue with PDF trailers that I cannot find an issue for... We have several files that trigger a NegativeArraySizeException like so:

java.lang.NegativeArraySizeException: -1
    at edu.harvard.hul.ois.jhove.module.pdf.Stream.initRead(Stream.java:111)
    at edu.harvard.hul.ois.jhove.module.pdf.CrossRefStream.initRead(CrossRefStream.java:161)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readXRefStreams(PdfModule.java:1452)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readXRefInfo(PdfModule.java:1429)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:823)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:782)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:567)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:439)
    at Jhove.main(Jhove.java:295)

jhove-cr-trailer.zip contains a synthetic example that's based on one the offending PDFs, but with all the copyrighted stuff (well, most of the file actually) removed. exception.pdf causes the aforementioned exception, while just-invalid.pdf reports a missing document catalog ­– well, I did remove most of the file, so that is correct behavior. The original PDFs are valid and can be viewed in PDF readers, but trigger the same exception in JHove.

The difference between those two files is a single byte: the newline used after the stream keyword that introduces the XRef stream. Having just a CR there causes the exception; LF doesn't, and CRLF also doesn't. Not sure whether the PDF spec says that just a CR is valid or not, but it probably shouldn't cause an exception.

Tested against JHove 1.12 + PDF-hul 1.11, and also against commit c45fd1cc, which seems to be latest as of today.

matthias-fratz-bsz commented 1 month ago

I seem to keep finding more weird behavior when PDF trailers aren't specification compliant. There's also an ArrrayIndexOutOfBoundsException when /Size is too small (but only sometimes), and I managed to get JHove into what looks like an infinite loop by accident.

Should I keep reporting those? (Probably in this bug because they are somewhat related.) Is the infinite loop considered a security issue? It's only a DoS, but extremely hard to avoid when using the Java API.

carlwilson commented 3 weeks ago

Hi @matthias-fratz-bsz do keep reporting them. I suspect that they are related but more info makes my work easier. Am planning a pass at this in the next couple of months.

matthias-fratz-bsz commented 3 weeks ago

So here we go. aioobe.pdf triggers an ArrayIndexOutOfBoundsException:

java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
    at edu.harvard.hul.ois.jhove.module.pdf.CrossRefStream.readNextObject(CrossRefStream.java:244)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readXRefStreams(PdfModule.java:1466)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readXRefInfo(PdfModule.java:1429)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:823)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:782)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:567)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:439)
    at Jhove.main(Jhove.java:295)

As far as I can tell, it isn't related to /Size as I originally thought. It is more likely caused by the stream being longer than what /Index would suggest. Weirdly, that happens regardless of /Length: I can set /Length to 0 and it still causes the same exception.

matthias-fratz-bsz commented 3 weeks ago

The second one, loop.pdf, causes JHove to hang longer than I am patient to wait for. The cross-reference object has a cyclic reference, so presumably JHove gets stuck in an infinite loop.

More precisely, that file's cross-reference object is incomplete, and its /Prev entry points to an offset that will end up parsing the same cross-reference object again. (Not actually the same offset as the XRef object, mostly because that's the configuration that I created by accident when I found the issue.) Interesting, JHove neither exceeds the stack size nor does it keep allocating more memory, so the code must be pretty well optimized there... it just doesn't check whether it has already read an object from that offset.

PDF readers should report something along the lines of "Failed to read the document catalog". The one I tried (Evince) doesn't get stuck in a loop. Unlike the other files, loop.pdf probably cannot be turned into a working PDF than can be opened and "looks normal", because it will necessarily have an incomplete cross-reference stream.