Open carlwilson opened 8 years ago
"at least some PDF files that appear to be in an infinite loop are completed after several hours".
We may be seeing this problem as well. We are in the process of adopting JHOVE for use at Canadiana.org, and one of the first steps I'm taking is to run against our entire TDR. For most files it works as expected, but for the files for one particular depositor the validation of PDF files are taking "several hours" per file.
For instance, after 9 hours on one file JHOVE indicated:
<reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
<lastModified>2016-09-13T15:05:52-04:00</lastModified>
<size>226787298</size>
<format>PDF</format>
<version>1.4</version>
<status>Well-Formed and valid</status>
<sigMatch>
<module>PDF-hul</module>
</sigMatch>
<messages>
<message subMessage="Total fonts = 2085" severity="info">Too many fonts to report; some fonts omitted.</message>
</messages>
<mimeType>application/pdf</mimeType>
<profiles>
<profile>Tagged PDF</profile>
<profile>ISO PDF/A-1, Level B</profile>
</profiles>
This is a 376 page PDF generated by ABBYY Recognition Server.
If I use pdftk to burst into separate single-page files, it then takes 11 seconds to run JHOVE on the output directory (377 files, as it contains the doc_data.txt file).
A sample file can be downloaded from http://numeris.canadiana.ca/view/numeris.RD_1965_FA00_027/ (there is a PDF download link beside the zoom button).
Okay, so up and running with a dev environment for JHOVE, we're seeing the original PDF parsed pretty quickly now by JHOVE with no hanging.
I built the app from source. After installing it, I ran the following:
user@computer:~/jhove$ time ./jhove -m PDF-hul -h text samples/01471-213X-12-33-S2.pdf
real 0m1.410s
user 0m3.450s
sys 0m0.175s
The story isn't as promising with the Koch PDF.
I ran:
time ./jhove -m PDF-hul -h text "samples/Koch_Notch Diss 060313 final.pdf" > output.txt
And let it run. I happened ot let it run for 2hrs, and during that time, an 80mb output file was generated and it kept going.
Inspecting the PDF file a little more, I noticed that there are 5853 images embedded in the file. Part of this is visible in the JHOVE report. The images can be extracted as follows:
pdfimages -all samples/Koch_Notch\ Diss\ 060313\ final.pdf images/i
That amounts to 34MB.
As noted. The JHOVE output for this file seems to work for the most part:
Jhove (Rel. 1.24.0, 2020-05-01)
Date: 2020-04-30 21:45:35 EDT
RepresentationInformation: samples/Koch_Notch Diss 060313 final.pdf
ReportingModule: PDF-hul, Rel. 1.12.2 (2019-12-10)
LastModified: 2020-04-30 20:19:52 EDT
Size: 24289259
Format: PDF
Version: 1.4
Status: Well-Formed and valid
SignatureMatches:
PDF-hul
InfoMessage: Too many fonts to report; some fonts omitted.: Total fonts = 4352
ID: PDF-HUL-136
MIMEtype: application/pdf
PDFMetadata:
Objects: 5636
FreeObjects: 1
IncrementalUpdates: 0
DocumentCatalog:
ViewerPreferences:
HideToolbar: false
HideMenubar: false
HideWindowUI: false
FitWindow: false
CenterWindow: true
DisplayDocTitle: false
NonFullScreenPageMode: UseNone
Direction: L2R
ViewArea: CropBox
ViewClip: CropBox
PrintArea: CropBox
PageClip: CropBox
PageLayout: OneColumn
PageMode: UseNone
Info:
Title: notch
Creator: Microsoft Office Word
Producer: Solid PDF Creator (7.1.879.0)
CreationDate: Wed Mar 06 13:05:11 EST 2013
ModDate: Wed Mar 06 13:09:45 EST 2013
ID: 0x1d37ca3a97344e9ebc259f4b5c0e661f661f29, 0x1f129e36f63742a9a9a0aeca4377224d224d29
Filters:
FilterPipeline: FlateDecode
FilterPipeline: DCTDecode
Images:
Image:
NisoImageMetadata:
FormatName: image/png
ImageWidth: 354
ImageHeight: 42
ColorSpace: RGB
BitsPerSample: 8
BitsPerSampleUnit: integer
Filter: FlateDecode
Image:
NisoImageMetadata:
FormatName: image/png
ImageWidth: 25
ImageHeight: 1
ColorSpace: RGB
BitsPerSample: 8
BitsPerSampleUnit: integer
Filter: FlateDecode
Image:
NisoImageMetadata:
FormatName: image/png
ImageWidth: 354
ImageHeight: 42
ColorSpace: RGB
BitsPerSample: 8
BitsPerSampleUnit: integer
Filter: FlateDecode
Noting that the NisoImageMetadata:
section starts to repeat after a while.
In the large output file generated, if we work under the premise there will be a NisoImageMetadata:
per image embedded in the PDF then I would anticipate ~5000 of those entries.
If we grep the output we can see that there lis likely a (probably) infinite loop we're running into when we start to output the image metadata:
cat err-output.txt | grep NisoImageMetadata | wc -l
405245
This will provide a good line of investigation to follow up on which I should be able to get to over the next few days.
Hi @ross-spencer finally got around to this, that's both a lot of images and a lot of lines. I'm going to mark your progress by removing the legacy tag on this. The issue's definitely with us in some form. Thanks :)
And I'm bumping the priority on consideration.
Some Tika output:
Slow progress on this, but I'm seeing some numbers add up. I don't think we're seeing an infinite loop, but I think there's a lot of data being duplicated between an outer loop and an inner loop.
Here we're entering this for loop with 214 pages of data that we iterate through: https://github.com/openpreserve/jhove/blob/8677ad043a59d93b0dbe949047ef064bc592bb08/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2219-L2222
But, when we reach the inner loop to iterate over the PDF Dictionary object returned for XObject resources, we're seeing the PDF Dictionary has 5850 entries, so we end up generating 214 * 5850
entries which we're then going to ask JHove to eventually output in the report.
The number of pages in the document is indeed 214
and the number of embedded images is 5850
. That equates to around 11267100
lines, which I estimate about 300 MB of data if we were to leave it running :nerd_face:
So, the next step is to trace this back to when the object is first populated, but I still need to walk through the code a bit more to get there.
Possibly related:
Where does JHove determine what stack/etc to use for the script that launches it?
root@447e0e9eca68:/opt/jhove# tail -1 /opt/jhove/jhove
java -Xss1024k -classpath "${CP}" Jhove -c "${CONFIG}" "${@}"
root@447e0e9eca68:/opt/jhove#
Is this a configurable setting where I missed the configuration option? I fairly regularly get "Out of memory!" errors from Java when trying to run JHove on PDF files.
Hi @ross-spencer, I think we've hit this issue with a PDF that is generating 1.3GB of jhove xml output. As you have seen, there seems to be a "pages * entries" issue. I was just wondering whether there was any progress/idea of when this might get prioritised.
I'm not sure whether I can share the file, but I might be able to find some time to debug through some of the code with it, particularly as you seem to have narrowed down the likely scope of the problem.
Dev Effort
1D
Description
While JHOVE processes this pdf, http://www.fcla.edu/daitss-test/files/01471-213X-12-33-S2.pdf, it runs out of all JAVA heap space. Is there an infinite loop during tag profile checking?
Stefan Hein - 29-04-2013:
@gmcgath 29-04-2013:
Stefan Hein 20-08-2016: