openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
169 stars 79 forks source link

PDF-hul: ArrayIndexOutOfBoundsException #248

Open RussellMcOrmond opened 7 years ago

RussellMcOrmond commented 7 years ago

Dev Effort

1D - investigation

Description

We have a large number of PDFs that are getting a Java language exception when JHOVE attempts to validate. An example can be downloaded from: http://gac.canadiana.ca/view/ooe.b4222507_008 (Download PDF button is beside the image resize - + buttons.)

russell@russell-desktop2:~/Downloads$ pdfinfo ooe.b4222507_008-document.pdf
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          16
Encrypted:      no
Page size:      635.3 x 815.05 pts
Page rot:       0
File size:      10342062 bytes
Optimized:      no
PDF version:    1.4
russell@russell-desktop2:~/Downloads$ identify ooe.b4222507_008-document.pdf
ooe.b4222507_008-document.pdf[0] PBM 635x815 635x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[1] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[2] PBM 635x815 635x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[3] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[4] PBM 635x815 635x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[5] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[6] PBM 645x815 645x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[7] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.009
ooe.b4222507_008-document.pdf[8] PBM 633x822 633x822+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.009
ooe.b4222507_008-document.pdf[9] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[10] PBM 607x813 607x813+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[11] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[12] PBM 607x813 607x813+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[13] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[14] PBM 607x815 607x815+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[15] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
russell@russell-desktop2:~/Downloads$ /opt/jhove/jhove ooe.b4222507_008-document.pdf
java.lang.ArrayIndexOutOfBoundsException: 710
    at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
    at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:605)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
    at Jhove.main(Jhove.java:292)
Jhove (Rel. 1.16.6, 2017-04-27)
 Date: 2017-05-01 11:47:40 EDT
 RepresentationInformation: ooe.b4222507_008-document.pdf
  ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
  LastModified: 2017-05-01 11:40:33 EDT
  Size: 10342062
  Format: bytestream
  Status: Well-Formed and valid
  SignatureMatches:
   PDF-hul
   WARC-kb
  MIMEtype: application/octet-stream
russell@russell-desktop2:~/Downloads$ /opt/jhove/jhove -m PDF-hul ooe.b4222507_008-document.pdf
java.lang.ArrayIndexOutOfBoundsException: 710
    at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
    at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
    at Jhove.main(Jhove.java:292)
Jhove (Rel. 1.16.6, 2017-04-27)
 Date: 2017-05-01 11:48:01 EDT
 RepresentationInformation: ooe.b4222507_008-document.pdf
  ReportingModule: PDF-hul, Rel. 1.8 (2017-03-14)
  LastModified: 2017-05-01 11:40:33 EDT
  Size: 10342062
  Format: PDF
  Status: Not well-formed
  SignatureMatches:
   PDF-hul
  ErrorMessage: 585
   Offset: 10339461
  ErrorMessage: No document catalog dictionary
   Offset: 0
  MIMEtype: application/pdf
russell@russell-desktop2:~/Downloads$ 

Note: pdfinfo is from poppler-utils, and identify is from ImageMagick. Identify is able to render all the PDF pages to an image, which is what it does to check if a PDF file is working. The PDF files in question will render in all the PDF viewers we have tested with.

Issue was also discussed in the jhove mailing list. We have a couple thousand PDF files that give a similar report in our repository which might be having the same issue.

If it turns out the problem is with the PDF file and not JHOVE, can someone with more knowledge of the PDF file format document how it is broken so that a report can be sent to https://poppler.freedesktop.org/ (and possibly other projects, but I haven't checked which tools generated all the PDF files that JHOVE is flagging).

MartinSpeller commented 4 years ago

PDF-hul: ArrayIndexOutOfBoundsException #248 - Assigned to TBA