openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
164 stars 78 forks source link

PDF-hul: NullPointerException with weird escaped chraracters in PDF trailer #876

Open matthias-fratz-bsz opened 11 months ago

matthias-fratz-bsz commented 11 months ago

We have several files that trigger the following NullPointerException:

java.lang.NullPointerException: Cannot invoke "edu.harvard.hul.ois.jhove.module.pdf.Token.isSimpleToken()" because "tok" is null
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:287)
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Parser.java:304)
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:275)
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Parser.java:340)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parseTrailer(PdfModule.java:1322)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:820)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:782)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:567)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:439)
    at Jhove.main(Jhove.java:295)

I cannot share the offending files, but jhove_npe.zip is a synthetic example I made that also triggers the NPE. It seems to be related to escaped characters in the file's trailer dictionary's /ID entry: \376\377\377\377 causes NPE, while \377\377\377\377 reports "Valid and well-formed". Various combinations around \3xx work or don't work; I was unable to investigate this further.

matthias-fratz-bsz commented 11 months ago

The original example only works against JHove 1.20.0 with some old version of PDF-hul. Sorry, my fault for not testing against latest...

Anyway, here jhove_npe_1224.zip is an updated version that also causes NPE on JHOVE 1.28.0 and PDF-hul 1.12.4. The ID is from the original file: Not sure why it was written like that (hex string would have been shorter), but it seems to be valid according to the PDF standard.

carlwilson commented 10 months ago

Thanks for the report. There are a couple of issues that are similar. The pointers and examples you've given will help us to track this down, I think.