Closed bauhuasbadguy closed 4 years ago
Have added a try - catch in the PDF reading code. It should now LOG.error and continue to next file. LOGs are now caught in <cwd?/logs/ami.log Please pull and report.
I pulled the latest version and I'm still getting the error
java.lang.RuntimeException: Cannot load PDF ./xml_results/PMC2718502/fulltext.pdf | Error: End-of-File, expected line
at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1753)
at org.contentmine.ami.tools.AMIPDFTool.docProcRunPDF(AMIPDFTool.java:270)
at org.contentmine.ami.tools.AMIPDFTool.processTree(AMIPDFTool.java:190)
at org.contentmine.ami.tools.AbstractAMITool.processTrees(AbstractAMITool.java:586)
at org.contentmine.ami.tools.AMIPDFTool.runSpecifics(AMIPDFTool.java:159)
at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:191)
at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:171)
at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:41)
at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
at picocli.CommandLine.access$1100(CommandLine.java:145)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
at org.contentmine.ami.tools.AMI.enhancedLoggingExecutionStrategy(AMI.java:185)
at picocli.CommandLine.execute(CommandLine.java:2058)
at org.contentmine.ami.tools.AMI.main(AMI.java:121)
Caused by: java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2589)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989)
at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1751)
... 16 more
Do I need to pull a specific branch?
Version info
ami 2020.07.23_10.44-NEXT-SNAPSHOT
(jar:file:/bin/ami3/appassembler/repo/ami3-2020.07.23_10.44-NEXT-SNAPSHOT.jar)
JVM: 1.8.0_252 (Oracle Corporation OpenJDK 64-Bit Server VM 25.252-b09)
OS: Linux 5.3.0-62-generic amd64
I. think I pushed later than that
I've defiantly pulled the latest version from github. Did it push right?
I have pushed again. I don't understand the version numbers, Remko. The name ami3-2020.07.23_10.44-NEXT-SNAPSHOT doesn't change between pushes. I assumed "10.44" was a time stamp.
bauhuasbadguyCan you try again, use -vvv and also make logs/ami.log available, thanks
On Thu, Jul 23, 2020 at 4:43 PM bauhuasbadguy notifications@github.com wrote:
I've defiantly pulled the latest version from github. Did it push right?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/52#issuecomment-663080679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5WKVK7JNHKDV7VHALR5BLCHANCNFSM4PEZQJ2Q .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
I re-pulled and I've got the same error. Here's the log file
19:41:41.343 INFO org.contentmine.ami.tools.AMI - args: [-p, ./xml_results, pdfbox]
19:41:41.372 DEBUG org.contentmine.ami.tools.AMI - Specified verbosity=0, this translates to level=WARN
19:41:41.372 DEBUG org.contentmine.ami.tools.AMI - Reconfiguring Console appender with WARN
19:41:41.373 INFO org.contentmine.ami.tools.AMI - (The console will show WARN level messages)
19:41:41.373 INFO org.contentmine.ami.tools.AMI - (Logs will also be printed to /logs/ami.log)
19:41:41.374 WARN org.contentmine.ami.tools.AbstractAMITool -
19:41:41.374 WARN org.contentmine.ami.tools.AbstractAMITool - Generic values (AMIPDFTool)
19:41:41.374 WARN org.contentmine.ami.tools.AbstractAMITool - ================================
19:41:41.479 INFO org.contentmine.ami.tools.AbstractAMITool - input basename null
19:41:41.479 INFO org.contentmine.ami.tools.AbstractAMITool - input basename list null
19:41:41.479 INFO org.contentmine.ami.tools.AbstractAMITool - cproject /./xml_results
19:41:41.479 INFO org.contentmine.ami.tools.AbstractAMITool - ctree
19:41:41.480 INFO org.contentmine.ami.tools.AbstractAMITool - cTreeList 148 trees [./xml_results/PMC1920248, ./xml_results/PMC224160
19:41:41.480 INFO org.contentmine.ami.tools.AbstractAMITool - excludeBase {}
19:41:41.480 INFO org.contentmine.ami.tools.AbstractAMITool - excludeTrees {}
19:41:41.480 INFO org.contentmine.ami.tools.AbstractAMITool - forceMake false
19:41:41.480 INFO org.contentmine.ami.tools.AbstractAMITool - includeBase {}
19:41:41.480 INFO org.contentmine.ami.tools.AbstractAMITool - includeTrees null
19:41:41.480 INFO org.contentmine.ami.tools.AbstractAMITool - log4j {}
19:41:41.481 INFO org.contentmine.ami.tools.AbstractAMITool - verbose 0
19:41:41.481 WARN org.contentmine.ami.tools.AbstractAMITool -
19:41:41.481 WARN org.contentmine.ami.tools.AbstractAMITool - Specific values (AMIPDFTool)
19:41:41.481 WARN org.contentmine.ami.tools.AbstractAMITool - ================================
19:41:41.481 INFO org.contentmine.ami.tools.AMIPDFTool - maxpages 5
19:41:41.481 INFO org.contentmine.ami.tools.AMIPDFTool - svgDirectoryName svg/
19:41:41.481 INFO org.contentmine.ami.tools.AMIPDFTool - minimagesize (10,10)
19:41:41.481 INFO org.contentmine.ami.tools.AMIPDFTool - outputSVG true
19:41:41.481 INFO org.contentmine.ami.tools.AMIPDFTool - pdf2html false
19:41:41.482 INFO org.contentmine.ami.tools.AMIPDFTool - imgDirectoryName pdfimages/
19:41:41.482 INFO org.contentmine.ami.tools.AMIPDFTool - outputPDFImages true
19:41:41.482 INFO org.contentmine.ami.tools.AMIPDFTool - parserDebug AMI_BRIEF
19:41:41.482 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC1920248
19:41:41.482 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC1920248
19:41:41.486 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2241601
19:41:41.486 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2241601
19:41:41.486 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2718502
19:41:41.487 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2718502
The version is listed as:
ami 2020.07.23_10.44-NEXT-SNAPSHOT
(jar:file:/bin/ami3/appassembler/repo/ami3-2020.07.23_10.44-NEXT-SNAPSHOT.jar)
JVM: 1.8.0_252 (Oracle Corporation OpenJDK 64-Bit Server VM 25.252-b09)
OS: Linux 5.3.0-62-generic amd64
Does the version update automatically when you push or do you need to edit a version file?
It didn't recognize the -vvv flag
On Thu, Jul 23, 2020 at 8:44 PM bauhuasbadguy notifications@github.com wrote:
It didn't recognize the -vvv flag
the syntax is: ami -vvv -p myproject pdfbox ...
—
You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/52#issuecomment-663197190, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSY6S73VCI46CXXXDSTR5CHI7ANCNFSM4PEZQJ2Q .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
OK new logs
23:19:15.027 INFO org.contentmine.ami.tools.AMI - args: [-vvv, -p, ./xml_results, pdfbox]
23:19:15.059 DEBUG org.contentmine.ami.tools.AMI - Specified verbosity=3, this translates to level=TRACE
23:19:15.060 DEBUG org.contentmine.ami.tools.AMI - Reconfiguring Console appender with TRACE
23:19:15.060 INFO org.contentmine.ami.tools.AMI - (The console will show TRACE level messages)
23:19:15.061 INFO org.contentmine.ami.tools.AMI - (Logs will also be printed to /logs/ami.log)
23:19:15.062 WARN org.contentmine.ami.tools.AbstractAMITool -
23:19:15.062 WARN org.contentmine.ami.tools.AbstractAMITool - Generic values (AMIPDFTool)
23:19:15.062 WARN org.contentmine.ami.tools.AbstractAMITool - ================================
23:19:15.195 INFO org.contentmine.ami.tools.AbstractAMITool - input basename null
23:19:15.195 INFO org.contentmine.ami.tools.AbstractAMITool - input basename list null
23:19:15.195 INFO org.contentmine.ami.tools.AbstractAMITool - cproject /./xml_results
23:19:15.195 INFO org.contentmine.ami.tools.AbstractAMITool - ctree
23:19:15.196 INFO org.contentmine.ami.tools.AbstractAMITool - cTreeList 148 trees [./xml_results/PMC1920248, ./xml_results/PMC224160
23:19:15.196 INFO org.contentmine.ami.tools.AbstractAMITool - excludeBase {}
23:19:15.196 INFO org.contentmine.ami.tools.AbstractAMITool - excludeTrees {}
23:19:15.196 INFO org.contentmine.ami.tools.AbstractAMITool - forceMake false
23:19:15.196 INFO org.contentmine.ami.tools.AbstractAMITool - includeBase {}
23:19:15.197 INFO org.contentmine.ami.tools.AbstractAMITool - includeTrees null
23:19:15.197 INFO org.contentmine.ami.tools.AbstractAMITool - log4j {}
23:19:15.197 INFO org.contentmine.ami.tools.AbstractAMITool - verbose 3
23:19:15.197 WARN org.contentmine.ami.tools.AbstractAMITool -
23:19:15.197 WARN org.contentmine.ami.tools.AbstractAMITool - Specific values (AMIPDFTool)
23:19:15.197 WARN org.contentmine.ami.tools.AbstractAMITool - ================================
23:19:15.198 INFO org.contentmine.ami.tools.AMIPDFTool - maxpages 5
23:19:15.198 INFO org.contentmine.ami.tools.AMIPDFTool - svgDirectoryName svg/
23:19:15.198 INFO org.contentmine.ami.tools.AMIPDFTool - minimagesize (10,10)
23:19:15.198 INFO org.contentmine.ami.tools.AMIPDFTool - outputSVG true
23:19:15.198 INFO org.contentmine.ami.tools.AMIPDFTool - pdf2html false
23:19:15.198 INFO org.contentmine.ami.tools.AMIPDFTool - imgDirectoryName pdfimages/
23:19:15.198 INFO org.contentmine.ami.tools.AMIPDFTool - outputPDFImages true
23:19:15.199 INFO org.contentmine.ami.tools.AMIPDFTool - parserDebug AMI_BRIEF
23:19:15.199 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC1920248
23:19:15.199 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC1920248
23:19:15.202 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2241601
23:19:15.203 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2241601
23:19:15.203 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2718502
23:19:15.203 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2718502
I'm using the ami dockerfile if that helps
@bauhuasbadguy I was able to reproduce the issue with latest master. I see the stacktrace on the console but not in the log. Investigating...
I think I found it. Two potential causes:
mvn clean package
to rebuild. The clean
removes the old binaries. (That was why I saw the stacktrace on the console but not in the log.)I made a change to split the logging: print a one-line error message to the console, and print the full stack trace to the log.
That gives the following output (without -v
)
$ ami -p ./xml_result pdfbox
Generic values (AMIPDFTool)
================================
Specific values (AMIPDFTool)
================================
AMIPDFTool cTree: PMC2718502
cTree: PMC2718502
Ignoring error that occurred while process tree: PMC2718502: java.lang.RuntimeException: Cannot load PDF ./xml_result/PMC2718502/fulltext.pdf | Error: End-of-File, expected line
With -vv
, details (including stack trace) are also printed to the console.
@bauhuasbadguy Please try again with latest master and close if you are happy with this.
OK, it's skipped the bad file and is merrily processing the rest of the pdfs. I'll close the issue
Glad to hear that. Thanks for the confirmation!
Its just finished and it listed the error at the end
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
at java.awt.image.Raster.createPackedRaster(Raster.java:467)
at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
at java.awt.image.BufferedImage.<init>(BufferedImage.java:333)
at sun.java2d.loops.GraphicsPrimitive.convertFrom(GraphicsPrimitive.java:557)
at sun.java2d.loops.GraphicsPrimitive.convertFrom(GraphicsPrimitive.java:541)
at sun.java2d.loops.MaskBlit$General.MaskBlit(MaskBlit.java:189)
at sun.java2d.loops.Blit$GeneralMaskBlit.Blit(Blit.java:204)
at sun.java2d.pipe.DrawImage.blitSurfaceData(DrawImage.java:959)
at sun.java2d.pipe.DrawImage.renderImageCopy(DrawImage.java:577)
at sun.java2d.pipe.DrawImage.copyImage(DrawImage.java:67)
at sun.java2d.pipe.DrawImage.copyImage(DrawImage.java:1014)
at sun.java2d.pipe.ValidatePipe.copyImage(ValidatePipe.java:186)
at sun.java2d.SunGraphics2D.drawImage(SunGraphics2D.java:3320)
at sun.java2d.SunGraphics2D.drawImage(SunGraphics2D.java:3298)
at java.awt.image.ColorConvertOp.ICCBIFilter(ColorConvertOp.java:339)
at java.awt.image.ColorConvertOp.filter(ColorConvertOp.java:282)
at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.toRGBImageAWT(PDColorSpace.java:314)
at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.toRGBImage(PDICCBased.java:375)
at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(SampledImageReader.java:411)
at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:226)
at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:479)
at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:460)
at org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1130)
at org.contentmine.pdf2svg2.AbstractPageParser.drawImage(AbstractPageParser.java:419)
at org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:875)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:509)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:270)
The error is also in the log file
18:01:43.137 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2718502
18:01:43.153 ERROR org.contentmine.ami.tools.AbstractAMITool - Ignoring error that occurred while process tree: PMC2718502: java.lang.RuntimeException: Cannot load PDF ./xml_results/PMC2718502/fulltext.pdf | Error: End-of-File, expected line
18:01:43.153 DEBUG org.contentmine.ami.tools.AbstractAMITool - Details of the problem processing tree PMC2718502
java.lang.RuntimeException: Cannot load PDF ./xml_results/PMC2718502/fulltext.pdf | Error: End-of-File, expected line
at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1753) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at org.contentmine.ami.tools.AMIPDFTool.docProcRunPDF(AMIPDFTool.java:270) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at org.contentmine.ami.tools.AMIPDFTool.processTree(AMIPDFTool.java:190) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at org.contentmine.ami.tools.AbstractAMITool.processTrees(AbstractAMITool.java:589) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at org.contentmine.ami.tools.AMIPDFTool.runSpecifics(AMIPDFTool.java:159) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:193) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:173) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at org.contentmine.ami.tools.AbstractAMITool.call(AbstractAMITool.java:41) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at picocli.CommandLine.executeUserObject(CommandLine.java:1933) ~[picocli-4.4.0.jar:4.4.0]
at picocli.CommandLine.access$1100(CommandLine.java:145) ~[picocli-4.4.0.jar:4.4.0]
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332) ~[picocli-4.4.0.jar:4.4.0]
at picocli.CommandLine$RunLast.handle(CommandLine.java:2326) ~[picocli-4.4.0.jar:4.4.0]
at picocli.CommandLine$RunLast.handle(CommandLine.java:2291) ~[picocli-4.4.0.jar:4.4.0]
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159) ~[picocli-4.4.0.jar:4.4.0]
at org.contentmine.ami.tools.AMI.enhancedLoggingExecutionStrategy(AMI.java:194) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
at picocli.CommandLine.execute(CommandLine.java:2058) [picocli-4.4.0.jar:4.4.0]
at org.contentmine.ami.tools.AMI.main(AMI.java:122) [ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
Caused by: java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1124) ~[pdfbox-2.0.19.jar:2.0.19]
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2589) ~[pdfbox-2.0.19.jar:2.0.19]
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560) ~[pdfbox-2.0.19.jar:2.0.19]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219) ~[pdfbox-2.0.19.jar:2.0.19]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099) ~[pdfbox-2.0.19.jar:2.0.19]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082) ~[pdfbox-2.0.19.jar:2.0.19]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041) ~[pdfbox-2.0.19.jar:2.0.19]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989) ~[pdfbox-2.0.19.jar:2.0.19]
at org.contentmine.cproject.files.CTree.processPDFTree(CTree.java:1751) ~[ami3-2020.07.24_07.23-NEXT-SNAPSHOT.jar:2020.07.24_07.23-NEXT-SNAPSHOT]
... 16 more
18:01:43.170 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2817415
18:01:43.170 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2817415
18:01:43.173 WARN org.contentmine.ami.tools.AbstractAMITool - AMIPDFTool cTree: PMC2837259
18:01:43.173 WARN org.contentmine.ami.tools.AMIPDFTool - cTree: PMC2837259
Thanks for following up.
Your last comment looks what I would expect: on the console it says "Ignoring error ..." - basically ignoring the corrupt pdf fulltext.pdf
and continuing to execute, while showing the detailed stack trace in the log for troubleshooting. So far so good.
The preceding comment shows a different problem: java.lang.OutOfMemoryError: Java heap space
.
This is not because of a corrupt pdf file and this is not a recoverable error... The process crashed because it ran out of memory.
I suspect this is caused by https://github.com/petermr/ami3/issues/31 (memory leak in ami pdf
somewhere).
Not sure where @petermr was with investigating #31...
For me to investigate I would need some way to reproduce the problem.
@bauhuasbadguy How many pdf files are you processing? Can you zip them up (perhaps zip up the CProject directory) and attach it to the https://github.com/petermr/ami3/issues/31 issue, together with the command you ran that triggered the java.lang.OutOfMemoryError: Java heap space
?
get_papers got a bad pdf file and it kills ami when it hits it. The bad pdf is attached to this issue. The command is
ami -p ./results pdfbox
The exception it returns is
fulltext.pdf