openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

PDF-hul: NullPointerException with weird escaped chraracters in PDF trailer #876

Open matthias-fratz-bsz opened 1 year ago

matthias-fratz-bsz commented 1 year ago

We have several files that trigger the following NullPointerException:

java.lang.NullPointerException: Cannot invoke "edu.harvard.hul.ois.jhove.module.pdf.Token.isSimpleToken()" because "tok" is null
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:287)
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Parser.java:304)
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:275)
    at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Parser.java:340)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parseTrailer(PdfModule.java:1322)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:820)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:782)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:567)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:439)
    at Jhove.main(Jhove.java:295)

I cannot share the offending files, but jhove_npe.zip is a synthetic example I made that also triggers the NPE. It seems to be related to escaped characters in the file's trailer dictionary's /ID entry: \376\377\377\377 causes NPE, while \377\377\377\377 reports "Valid and well-formed". Various combinations around \3xx work or don't work; I was unable to investigate this further.

matthias-fratz-bsz commented 1 year ago

The original example only works against JHove 1.20.0 with some old version of PDF-hul. Sorry, my fault for not testing against latest...

Anyway, here jhove_npe_1224.zip is an updated version that also causes NPE on JHOVE 1.28.0 and PDF-hul 1.12.4. The ID is from the original file: Not sure why it was written like that (hex string would have been shorter), but it seems to be valid according to the PDF standard.

carlwilson commented 1 year ago

Thanks for the report. There are a couple of issues that are similar. The pointers and examples you've given will help us to track this down, I think.