petermr / ami3

Integration of cephis and normami code into a single base. Tests will be slimmed down
Apache License 2.0
17 stars 5 forks source link

We need a try/catch for bad pdf files #52

Closed bauhuasbadguy closed 4 years ago

bauhuasbadguy commented 4 years ago

get_papers got a bad pdf file and it kills ami when it hits it. The bad pdf is attached to this issue. The command is ami -p ./results pdfbox

The exception it returns is

java.lang.RuntimeException: Cannot load PDF ./results/PMC2718502/fulltext.pdf
    at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1753)
    at org.contentmine.ami.tools.AMIPDFTool.docProcRunPDF(AMIPDFTool.java:270)
    at org.contentmine.ami.tools.AMIPDFTool.processTree(AMIPDFTool.java:190)
    at org.contentmine.ami.tools.AbstractAMITool.processTrees(AbstractAMITool.java:588)
    at org.contentmine.ami.tools.AMIPDFTool.runSpecifics(AMIPDFTool.java:159)
    at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:193)
    at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:173)
    at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:41)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1853)
    at picocli.CommandLine.access$1100(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2255)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2249)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2213)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2080)
    at org.contentmine.ami.tools.AMI.enhancedLoggingExecutionStrategy(AMI.java:176)
    at picocli.CommandLine.execute(CommandLine.java:1978)
    at org.contentmine.ami.tools.AMI.main(AMI.java:113)
Caused by: java.io.IOException: Error: End-of-File, expected line
    at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124)
    at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2589)
    at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989)
    at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1751)
    ... 16 more

fulltext.pdf

petermr commented 4 years ago

Have added a try - catch in the PDF reading code. It should now LOG.error and continue to next file. LOGs are now caught in <cwd?/logs/ami.log Please pull and report.

bauhuasbadguy commented 4 years ago

I pulled the latest version and I'm still getting the error

java.lang.RuntimeException: Cannot load PDF ./xml_results/PMC2718502/fulltext.pdf | Error: End-of-File, expected line
    at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1753)
    at org.contentmine.ami.tools.AMIPDFTool.docProcRunPDF(AMIPDFTool.java:270)
    at org.contentmine.ami.tools.AMIPDFTool.processTree(AMIPDFTool.java:190)
    at org.contentmine.ami.tools.AbstractAMITool.processTrees(AbstractAMITool.java:586)
    at org.contentmine.ami.tools.AMIPDFTool.runSpecifics(AMIPDFTool.java:159)
    at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:191)
    at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:171)
    at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:41)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
    at picocli.CommandLine.access$1100(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
    at org.contentmine.ami.tools.AMI.enhancedLoggingExecutionStrategy(AMI.java:185)
    at picocli.CommandLine.execute(CommandLine.java:2058)
    at org.contentmine.ami.tools.AMI.main(AMI.java:121)
Caused by: java.io.IOException: Error: End-of-File, expected line
    at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124)
    at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2589)
    at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989)
    at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1751)
    ... 16 more

Do I need to pull a specific branch?

bauhuasbadguy commented 4 years ago

Version info

ami 2020.07.23_10.44-NEXT-SNAPSHOT
(jar:file:/bin/ami3/appassembler/repo/ami3-2020.07.23_10.44-NEXT-SNAPSHOT.jar)
JVM: 1.8.0_252 (Oracle Corporation OpenJDK 64-Bit Server VM 25.252-b09)
OS: Linux 5.3.0-62-generic amd64
petermr commented 4 years ago

I. think I pushed later than that

bauhuasbadguy commented 4 years ago

I've defiantly pulled the latest version from github. Did it push right?

petermr commented 4 years ago

I have pushed again. I don't understand the version numbers, Remko. The name ami3-2020.07.23_10.44-NEXT-SNAPSHOT doesn't change between pushes. I assumed "10.44" was a time stamp.

bauhuasbadguyCan you try again, use -vvv and also make logs/ami.log available, thanks

On Thu, Jul 23, 2020 at 4:43 PM bauhuasbadguy notifications@github.com wrote:

I've defiantly pulled the latest version from github. Did it push right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/52#issuecomment-663080679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5WKVK7JNHKDV7VHALR5BLCHANCNFSM4PEZQJ2Q .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

bauhuasbadguy commented 4 years ago

I re-pulled and I've got the same error. Here's the log file

19:41:41.343 INFO  org.contentmine.ami.tools.AMI - args: [-p, ./xml_results, pdfbox]
19:41:41.372 DEBUG org.contentmine.ami.tools.AMI - Specified verbosity=0, this translates to level=WARN
19:41:41.372 DEBUG org.contentmine.ami.tools.AMI - Reconfiguring Console appender with WARN
19:41:41.373 INFO  org.contentmine.ami.tools.AMI - (The console will show WARN level messages)
19:41:41.373 INFO  org.contentmine.ami.tools.AMI - (Logs will also be printed to /logs/ami.log)
19:41:41.374 WARN  org.contentmine.ami.tools.AbstractAMITool - 
19:41:41.374 WARN  org.contentmine.ami.tools.AbstractAMITool - Generic values (AMIPDFTool)
19:41:41.374 WARN  org.contentmine.ami.tools.AbstractAMITool - ================================
19:41:41.479 INFO  org.contentmine.ami.tools.AbstractAMITool - input basename      null
19:41:41.479 INFO  org.contentmine.ami.tools.AbstractAMITool - input basename list null
19:41:41.479 INFO  org.contentmine.ami.tools.AbstractAMITool - cproject            /./xml_results
19:41:41.479 INFO  org.contentmine.ami.tools.AbstractAMITool - ctree               
19:41:41.480 INFO  org.contentmine.ami.tools.AbstractAMITool - cTreeList           148 trees [./xml_results/PMC1920248, ./xml_results/PMC224160
19:41:41.480 INFO  org.contentmine.ami.tools.AbstractAMITool - excludeBase         {}
19:41:41.480 INFO  org.contentmine.ami.tools.AbstractAMITool - excludeTrees        {}
19:41:41.480 INFO  org.contentmine.ami.tools.AbstractAMITool - forceMake           false
19:41:41.480 INFO  org.contentmine.ami.tools.AbstractAMITool - includeBase         {}
19:41:41.480 INFO  org.contentmine.ami.tools.AbstractAMITool - includeTrees        null
19:41:41.480 INFO  org.contentmine.ami.tools.AbstractAMITool - log4j               {}
19:41:41.481 INFO  org.contentmine.ami.tools.AbstractAMITool - verbose             0
19:41:41.481 WARN  org.contentmine.ami.tools.AbstractAMITool - 
19:41:41.481 WARN  org.contentmine.ami.tools.AbstractAMITool - Specific values (AMIPDFTool)
19:41:41.481 WARN  org.contentmine.ami.tools.AbstractAMITool - ================================
19:41:41.481 INFO  org.contentmine.ami.tools.AMIPDFTool - maxpages            5
19:41:41.481 INFO  org.contentmine.ami.tools.AMIPDFTool - svgDirectoryName    svg/
19:41:41.481 INFO  org.contentmine.ami.tools.AMIPDFTool - minimagesize        (10,10)
19:41:41.481 INFO  org.contentmine.ami.tools.AMIPDFTool - outputSVG           true
19:41:41.481 INFO  org.contentmine.ami.tools.AMIPDFTool - pdf2html            false
19:41:41.482 INFO  org.contentmine.ami.tools.AMIPDFTool - imgDirectoryName    pdfimages/
19:41:41.482 INFO  org.contentmine.ami.tools.AMIPDFTool - outputPDFImages     true
19:41:41.482 INFO  org.contentmine.ami.tools.AMIPDFTool - parserDebug         AMI_BRIEF
19:41:41.482 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC1920248
19:41:41.482 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC1920248
19:41:41.486 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2241601
19:41:41.486 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2241601
19:41:41.486 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2718502
19:41:41.487 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2718502

The version is listed as:

ami 2020.07.23_10.44-NEXT-SNAPSHOT
(jar:file:/bin/ami3/appassembler/repo/ami3-2020.07.23_10.44-NEXT-SNAPSHOT.jar)
JVM: 1.8.0_252 (Oracle Corporation OpenJDK 64-Bit Server VM 25.252-b09)
OS: Linux 5.3.0-62-generic amd64

Does the version update automatically when you push or do you need to edit a version file?

bauhuasbadguy commented 4 years ago

It didn't recognize the -vvv flag

petermr commented 4 years ago

On Thu, Jul 23, 2020 at 8:44 PM bauhuasbadguy notifications@github.com wrote:

It didn't recognize the -vvv flag

the syntax is: ami -vvv -p myproject pdfbox ...

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/52#issuecomment-663197190, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSY6S73VCI46CXXXDSTR5CHI7ANCNFSM4PEZQJ2Q .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

bauhuasbadguy commented 4 years ago

OK new logs

23:19:15.027 INFO  org.contentmine.ami.tools.AMI - args: [-vvv, -p, ./xml_results, pdfbox]
23:19:15.059 DEBUG org.contentmine.ami.tools.AMI - Specified verbosity=3, this translates to level=TRACE
23:19:15.060 DEBUG org.contentmine.ami.tools.AMI - Reconfiguring Console appender with TRACE
23:19:15.060 INFO  org.contentmine.ami.tools.AMI - (The console will show TRACE level messages)
23:19:15.061 INFO  org.contentmine.ami.tools.AMI - (Logs will also be printed to /logs/ami.log)
23:19:15.062 WARN  org.contentmine.ami.tools.AbstractAMITool - 
23:19:15.062 WARN  org.contentmine.ami.tools.AbstractAMITool - Generic values (AMIPDFTool)
23:19:15.062 WARN  org.contentmine.ami.tools.AbstractAMITool - ================================
23:19:15.195 INFO  org.contentmine.ami.tools.AbstractAMITool - input basename      null
23:19:15.195 INFO  org.contentmine.ami.tools.AbstractAMITool - input basename list null
23:19:15.195 INFO  org.contentmine.ami.tools.AbstractAMITool - cproject            /./xml_results
23:19:15.195 INFO  org.contentmine.ami.tools.AbstractAMITool - ctree               
23:19:15.196 INFO  org.contentmine.ami.tools.AbstractAMITool - cTreeList           148 trees [./xml_results/PMC1920248, ./xml_results/PMC224160
23:19:15.196 INFO  org.contentmine.ami.tools.AbstractAMITool - excludeBase         {}
23:19:15.196 INFO  org.contentmine.ami.tools.AbstractAMITool - excludeTrees        {}
23:19:15.196 INFO  org.contentmine.ami.tools.AbstractAMITool - forceMake           false
23:19:15.196 INFO  org.contentmine.ami.tools.AbstractAMITool - includeBase         {}
23:19:15.197 INFO  org.contentmine.ami.tools.AbstractAMITool - includeTrees        null
23:19:15.197 INFO  org.contentmine.ami.tools.AbstractAMITool - log4j               {}
23:19:15.197 INFO  org.contentmine.ami.tools.AbstractAMITool - verbose             3
23:19:15.197 WARN  org.contentmine.ami.tools.AbstractAMITool - 
23:19:15.197 WARN  org.contentmine.ami.tools.AbstractAMITool - Specific values (AMIPDFTool)
23:19:15.197 WARN  org.contentmine.ami.tools.AbstractAMITool - ================================
23:19:15.198 INFO  org.contentmine.ami.tools.AMIPDFTool - maxpages            5
23:19:15.198 INFO  org.contentmine.ami.tools.AMIPDFTool - svgDirectoryName    svg/
23:19:15.198 INFO  org.contentmine.ami.tools.AMIPDFTool - minimagesize        (10,10)
23:19:15.198 INFO  org.contentmine.ami.tools.AMIPDFTool - outputSVG           true
23:19:15.198 INFO  org.contentmine.ami.tools.AMIPDFTool - pdf2html            false
23:19:15.198 INFO  org.contentmine.ami.tools.AMIPDFTool - imgDirectoryName    pdfimages/
23:19:15.198 INFO  org.contentmine.ami.tools.AMIPDFTool - outputPDFImages     true
23:19:15.199 INFO  org.contentmine.ami.tools.AMIPDFTool - parserDebug         AMI_BRIEF
23:19:15.199 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC1920248
23:19:15.199 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC1920248
23:19:15.202 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2241601
23:19:15.203 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2241601
23:19:15.203 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2718502
23:19:15.203 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2718502
bauhuasbadguy commented 4 years ago

I'm using the ami dockerfile if that helps

remkop commented 4 years ago

@bauhuasbadguy I was able to reproduce the issue with latest master. I see the stacktrace on the console but not in the log. Investigating...

remkop commented 4 years ago

I think I found it. Two potential causes:

I made a change to split the logging: print a one-line error message to the console, and print the full stack trace to the log. That gives the following output (without -v)

$ ami  -p ./xml_result pdfbox

Generic values (AMIPDFTool)
================================

Specific values (AMIPDFTool)
================================
AMIPDFTool cTree: PMC2718502
cTree: PMC2718502
Ignoring error that occurred while process tree: PMC2718502: java.lang.RuntimeException: Cannot load PDF ./xml_result/PMC2718502/fulltext.pdf | Error: End-of-File, expected line

With -vv, details (including stack trace) are also printed to the console.

remkop commented 4 years ago

@bauhuasbadguy Please try again with latest master and close if you are happy with this.

bauhuasbadguy commented 4 years ago

OK, it's skipped the bad file and is merrily processing the rest of the pdfs. I'll close the issue

remkop commented 4 years ago

Glad to hear that. Thanks for the confirmation!

bauhuasbadguy commented 4 years ago

Its just finished and it listed the error at the end

bauhuasbadguy commented 4 years ago
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
    at java.awt.image.Raster.createPackedRaster(Raster.java:467)
    at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
    at java.awt.image.BufferedImage.<init>(BufferedImage.java:333)
    at sun.java2d.loops.GraphicsPrimitive.convertFrom(GraphicsPrimitive.java:557)
    at sun.java2d.loops.GraphicsPrimitive.convertFrom(GraphicsPrimitive.java:541)
    at sun.java2d.loops.MaskBlit$General.MaskBlit(MaskBlit.java:189)
    at sun.java2d.loops.Blit$GeneralMaskBlit.Blit(Blit.java:204)
    at sun.java2d.pipe.DrawImage.blitSurfaceData(DrawImage.java:959)
    at sun.java2d.pipe.DrawImage.renderImageCopy(DrawImage.java:577)
    at sun.java2d.pipe.DrawImage.copyImage(DrawImage.java:67)
    at sun.java2d.pipe.DrawImage.copyImage(DrawImage.java:1014)
    at sun.java2d.pipe.ValidatePipe.copyImage(ValidatePipe.java:186)
    at sun.java2d.SunGraphics2D.drawImage(SunGraphics2D.java:3320)
    at sun.java2d.SunGraphics2D.drawImage(SunGraphics2D.java:3298)
    at java.awt.image.ColorConvertOp.ICCBIFilter(ColorConvertOp.java:339)
    at java.awt.image.ColorConvertOp.filter(ColorConvertOp.java:282)
    at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.toRGBImageAWT(PDColorSpace.java:314)
    at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.toRGBImage(PDICCBased.java:375)
    at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(SampledImageReader.java:411)
    at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:226)
    at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:479)
    at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:460)
    at org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1130)
    at org.contentmine.pdf2svg2.AbstractPageParser.drawImage(AbstractPageParser.java:419)
    at org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:875)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:509)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
    at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:270)
bauhuasbadguy commented 4 years ago

The error is also in the log file

18:01:43.137 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2718502
18:01:43.153 ERROR org.contentmine.ami.tools.AbstractAMITool - Ignoring error that occurred while process tree: PMC2718502: java.lang.RuntimeException: Cannot load PDF ./xml_results/PMC2718502/fulltext.pdf | Error: End-of-File, expected line
18:01:43.153 DEBUG org.contentmine.ami.tools.AbstractAMITool - Details of the problem processing tree PMC2718502
java.lang.RuntimeException: Cannot load PDF ./xml_results/PMC2718502/fulltext.pdf | Error: End-of-File, expected line
    at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1753) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at org.contentmine.ami.tools.AMIPDFTool.docProcRunPDF(AMIPDFTool.java:270) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at org.contentmine.ami.tools.AMIPDFTool.processTree(AMIPDFTool.java:190) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at org.contentmine.ami.tools.AbstractAMITool.processTrees(AbstractAMITool.java:589) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at org.contentmine.ami.tools.AMIPDFTool.runSpecifics(AMIPDFTool.java:159) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:193) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:173) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:41) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at picocli.CommandLine.executeUserObject(CommandLine.java:1933) ~[picocli-4.4.0.jar:4.4.0]
    at picocli.CommandLine.access$1100(CommandLine.java:145) ~[picocli-4.4.0.jar:4.4.0]
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332) ~[picocli-4.4.0.jar:4.4.0]
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2326) ~[picocli-4.4.0.jar:4.4.0]
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2291) ~[picocli-4.4.0.jar:4.4.0]
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159) ~[picocli-4.4.0.jar:4.4.0]
    at org.contentmine.ami.tools.AMI.enhancedLoggingExecutionStrategy(AMI.java:194) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    at picocli.CommandLine.execute(CommandLine.java:2058) [picocli-4.4.0.jar:4.4.0]
    at org.contentmine.ami.tools.AMI.main(AMI.java:122) [ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
Caused by: java.io.IOException: Error: End-of-File, expected line
    at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2589) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989) ~[pdfbox-2.0.19.jar:2.0.19]
    at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1751) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
    ... 16 more
18:01:43.170 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2817415
18:01:43.170 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2817415
18:01:43.173 WARN  org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2837259
18:01:43.173 WARN  org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2837259
remkop commented 4 years ago

Thanks for following up.

Your last comment looks what I would expect: on the console it says "Ignoring error ..." - basically ignoring the corrupt pdf fulltext.pdf and continuing to execute, while showing the detailed stack trace in the log for troubleshooting. So far so good.

The preceding comment shows a different problem: java.lang.OutOfMemoryError: Java heap space. This is not because of a corrupt pdf file and this is not a recoverable error... The process crashed because it ran out of memory.

I suspect this is caused by https://github.com/petermr/ami3/issues/31 (memory leak in ami pdf somewhere). Not sure where @petermr was with investigating #31...

remkop commented 4 years ago

For me to investigate I would need some way to reproduce the problem. @bauhuasbadguy How many pdf files are you processing? Can you zip them up (perhaps zip up the CProject directory) and attach it to the https://github.com/petermr/ami3/issues/31 issue, together with the command you ran that triggered the java.lang.OutOfMemoryError: Java heap space?