openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
168 stars 79 forks source link

Runs out of java heap space while JHOVE process a specific pdf against tag profiles #115

Open carlwilson opened 8 years ago

carlwilson commented 8 years ago

Dev Effort

1D

Description

While JHOVE processes this pdf, http://www.fcla.edu/daitss-test/files/01471-213X-12-33-S2.pdf, it runs out of all JAVA heap space. Is there an infinite loop during tag profile checking?

./jhove -c conf/jhove.conf -m pdf-hul ~/Workspace/describe/01471-213X-12-33-S2.pdf 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.(StringBuilder.java:68)
at edu.harvard.hul.ois.jhove.module.pdf.Tokenizer.getNext(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.getNext(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.getNext(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObjectDef(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObjectDef(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.isStructElem(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.buildSubtree(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.buildSubtree(Unknown Source)

at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.buildSubtree(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureTree.getChildren(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureTree.(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.TaggedProfile.satisfiesThisProfile(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.PdfProfile.satisfiesProfile(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)

Stefan Hein - 29-04-2013:

I have the same problem with the following pdf file: http://docserv.uni-duesseldorf.de/servlets/DerivateServlet/Derivate-25614 Are there any updates to that issue?

@gmcgath 29-04-2013:

I'm working on this issue. One user showed that with huge amounts of patience and memory, at least some PDF files that appear to be in an infinite loop are completed after several hours. The StructureTree object can take a huge amount of memory for some files, but once it's build only a couple of flags that were set during its construction are checked. This suggests that the whole tree doesn't have to be in memory at once. I hope to have a fix that takes this into account before too long. Thanks for the additional example. I'll use it in testing.

Stefan Hein 20-08-2016:

Unfortunately this reported bug is still existing in JHOVE 1.10.

RussellMcOrmond commented 7 years ago

"at least some PDF files that appear to be in an infinite loop are completed after several hours".

We may be seeing this problem as well. We are in the process of adopting JHOVE for use at Canadiana.org, and one of the first steps I'm taking is to run against our entire TDR. For most files it works as expected, but for the files for one particular depositor the validation of PDF files are taking "several hours" per file.

For instance, after 9 hours on one file JHOVE indicated:

  <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
  <lastModified>2016-09-13T15:05:52-04:00</lastModified>
  <size>226787298</size>
  <format>PDF</format>
  <version>1.4</version>
  <status>Well-Formed and valid</status>
  <sigMatch>
  <module>PDF-hul</module>
  </sigMatch>
  <messages>
   <message subMessage="Total fonts = 2085" severity="info">Too many fonts to report; some fonts omitted.</message>
  </messages>
  <mimeType>application/pdf</mimeType>
  <profiles>
   <profile>Tagged PDF</profile>
   <profile>ISO PDF/A-1, Level B</profile>
  </profiles>

This is a 376 page PDF generated by ABBYY Recognition Server.

If I use pdftk to burst into separate single-page files, it then takes 11 seconds to run JHOVE on the output directory (377 files, as it contains the doc_data.txt file).

A sample file can be downloaded from http://numeris.canadiana.ca/view/numeris.RD_1965_FA00_027/ (there is a PDF download link beside the zoom button).

RussellMcOrmond commented 7 years ago

161 may be related to this issue.

ross-spencer commented 4 years ago

Okay, so up and running with a dev environment for JHOVE, we're seeing the original PDF parsed pretty quickly now by JHOVE with no hanging.

I built the app from source. After installing it, I ran the following:

user@computer:~/jhove$ time ./jhove -m PDF-hul -h text samples/01471-213X-12-33-S2.pdf 

real    0m1.410s
user    0m3.450s
sys 0m0.175s

The story isn't as promising with the Koch PDF.

I ran:

time ./jhove -m PDF-hul -h text "samples/Koch_Notch Diss 060313 final.pdf" > output.txt

And let it run. I happened ot let it run for 2hrs, and during that time, an 80mb output file was generated and it kept going.

Inspecting the PDF file a little more, I noticed that there are 5853 images embedded in the file. Part of this is visible in the JHOVE report. The images can be extracted as follows:

pdfimages -all samples/Koch_Notch\ Diss\ 060313\ final.pdf images/i

That amounts to 34MB.

As noted. The JHOVE output for this file seems to work for the most part:

Jhove (Rel. 1.24.0, 2020-05-01)
 Date: 2020-04-30 21:45:35 EDT
 RepresentationInformation: samples/Koch_Notch Diss 060313 final.pdf
  ReportingModule: PDF-hul, Rel. 1.12.2 (2019-12-10)
  LastModified: 2020-04-30 20:19:52 EDT
  Size: 24289259
  Format: PDF
  Version: 1.4
  Status: Well-Formed and valid
  SignatureMatches:
   PDF-hul
  InfoMessage: Too many fonts to report; some fonts omitted.: Total fonts = 4352
   ID: PDF-HUL-136
  MIMEtype: application/pdf
  PDFMetadata: 
   Objects: 5636
   FreeObjects: 1
   IncrementalUpdates: 0
   DocumentCatalog: 
    ViewerPreferences: 
     HideToolbar: false
     HideMenubar: false
     HideWindowUI: false
     FitWindow: false
     CenterWindow: true
     DisplayDocTitle: false
     NonFullScreenPageMode: UseNone
     Direction: L2R
     ViewArea: CropBox
     ViewClip: CropBox
     PrintArea: CropBox
     PageClip: CropBox
    PageLayout: OneColumn
    PageMode: UseNone
   Info: 
    Title: notch
    Creator: Microsoft Office Word
    Producer: Solid PDF Creator  (7.1.879.0)
    CreationDate: Wed Mar 06 13:05:11 EST 2013
    ModDate: Wed Mar 06 13:09:45 EST 2013
   ID: 0x1d37ca3a97344e9ebc259f4b5c0e661f661f29, 0x1f129e36f63742a9a9a0aeca4377224d224d29
   Filters: 
    FilterPipeline: FlateDecode
    FilterPipeline: DCTDecode
   Images: 
    Image: 
     NisoImageMetadata: 
      FormatName: image/png
      ImageWidth: 354
      ImageHeight: 42
      ColorSpace: RGB
      BitsPerSample: 8
      BitsPerSampleUnit: integer
     Filter: FlateDecode
    Image: 
     NisoImageMetadata: 
      FormatName: image/png
      ImageWidth: 25
      ImageHeight: 1
      ColorSpace: RGB
      BitsPerSample: 8
      BitsPerSampleUnit: integer
     Filter: FlateDecode
    Image: 
     NisoImageMetadata: 
      FormatName: image/png
      ImageWidth: 354
      ImageHeight: 42
      ColorSpace: RGB
      BitsPerSample: 8
      BitsPerSampleUnit: integer
     Filter: FlateDecode

Noting that the NisoImageMetadata: section starts to repeat after a while.

In the large output file generated, if we work under the premise there will be a NisoImageMetadata: per image embedded in the PDF then I would anticipate ~5000 of those entries.

If we grep the output we can see that there lis likely a (probably) infinite loop we're running into when we start to output the image metadata:

cat err-output.txt | grep NisoImageMetadata | wc -l
405245

This will provide a good line of investigation to follow up on which I should be able to get to over the next few days.

carlwilson commented 4 years ago

Hi @ross-spencer finally got around to this, that's both a lot of images and a lot of lines. I'm going to mark your progress by removing the legacy tag on this. The issue's definitely with us in some form. Thanks :)

And I'm bumping the priority on consideration.

MartinSpeller commented 4 years ago

Runs out of java heap space while JHOVE process a specific pdf against tag profiles #115 - Assigned to ross-spencer

ross-spencer commented 4 years ago

Some Tika output:

ross-spencer commented 4 years ago

Slow progress on this, but I'm seeing some numbers add up. I don't think we're seeing an infinite loop, but I think there's a lot of data being duplicated between an outer loop and an inner loop.

Here we're entering this for loop with 214 pages of data that we iterate through: https://github.com/openpreserve/jhove/blob/8677ad043a59d93b0dbe949047ef064bc592bb08/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2219-L2222

But, when we reach the inner loop to iterate over the PDF Dictionary object returned for XObject resources, we're seeing the PDF Dictionary has 5850 entries, so we end up generating 214 * 5850 entries which we're then going to ask JHove to eventually output in the report.

https://github.com/openpreserve/jhove/blob/8677ad043a59d93b0dbe949047ef064bc592bb08/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2226-L2227

The number of pages in the document is indeed 214 and the number of embedded images is 5850. That equates to around 11267100 lines, which I estimate about 300 MB of data if we were to leave it running :nerd_face:

So, the next step is to trace this back to when the object is first populated, but I still need to walk through the code a bit more to get there.

RussellMcOrmond commented 4 years ago

Possibly related:

Where does JHove determine what stack/etc to use for the script that launches it?

root@447e0e9eca68:/opt/jhove# tail -1 /opt/jhove/jhove
java -Xss1024k -classpath "${CP}" Jhove -c "${CONFIG}" "${@}"
root@447e0e9eca68:/opt/jhove#

Is this a configurable setting where I missed the configuration option? I fairly regularly get "Out of memory!" errors from Java when trying to run JHove on PDF files.

jackdos commented 3 years ago

Hi @ross-spencer, I think we've hit this issue with a PDF that is generating 1.3GB of jhove xml output. As you have seen, there seems to be a "pages * entries" issue. I was just wondering whether there was any progress/idea of when this might get prioritised.

I'm not sure whether I can share the file, but I might be able to find some time to debug through some of the code with it, particularly as you seem to have narrowed down the likely scope of the problem.