gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine
Other
1 stars 0 forks source link

zootaxa processing resume #496

Closed myrmoteras closed 6 years ago

myrmoteras commented 6 years ago

Hi @gsautter I resume the processing of the Zootaxa and other journals. For that I downloaded the latest version of GGI. Here is the first respective email. I used java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" CACHE=./BatchCache FM=U

The PDF that do not process and are large are here https://drive.google.com/drive/folders/0B_yrQwn4yBySTlBxMnhWd3o1Zkk?usp=sharing ie zootaxa.4483.1.10. zootaxa.4483.1.11. zootaxa.4483.1.3.pdf zootaxa.4483.1.4.pdf zootaxa.4483.1.5.pdf

All these files do not work. So I stop right now and hope to get a solution Thanks for looking into this donat GgImagine.20180926-1558.out.zip

Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Error processing document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.1.pdf': de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher$AnnotationIndex.addAnnotation(Lde/uka/ipd/idaho/gamta/Annotation;Ljava/lang/String;)V java.lang.NoSuchMethodError: de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher$AnnotationIndex.addAnnotation(Lde/uka/ipd/idaho/gamta/Annotation;Ljava/lang/String;)V at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBrokenURLsAndDOIs(RefParse.java:2327) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBaseDetails(RefParse.java:2215) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:973) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:944) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:41) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:36) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseAnalyzer.process(RefParseAnalyzer.java:59) at de.uka.ipd.idaho.goldenGate.plugin.analyzers.AnalyzerManager$AnalyzerDocumentProcessor.process(AnalyzerManager.java:207) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:329) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:295) at de.uka.ipd.idaho.im.imagine.plugins.tools.ImageMarkupToolManager$DpImageMarkupTool.process(ImageMarkupToolManager.java:549) at de.uka.ipd.idaho.im.imagine.batch.GoldenGateImagineBatch.main(GoldenGateImagineBatch.java:703) Processing document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.10.pdf'

"""""""""""""""""""""""""""""""""""""""""" Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Add Document Meta Data' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Taxonomic Keys' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Error processing document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.11.pdf': Unexpected character: @ near index 27 '(' "[a-z]{2,4}\.?" @:part ')' ^ java.util.regex.PatternSyntaxException: Unexpected character: @ near index 27 '(' "[a-z]{2,4}\.?" @:part ')' ^ at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getPattern(AnnotationPatternMatcher.java:902) at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getMatchTrees(AnnotationPatternMatcher.java:704) at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getMatches(AnnotationPatternMatcher.java:685) at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getMatches(AnnotationPatternMatcher.java:656) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getNumberDetailBlocks(RefParse.java:6852) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBaseDetails(RefParse.java:2285) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:973) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:944) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:41) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:36) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseAnalyzer.process(RefParseAnalyzer.java:59) at de.uka.ipd.idaho.goldenGate.plugin.analyzers.AnalyzerManager$AnalyzerDocumentProcessor.process(AnalyzerManager.java:207) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:329) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:295) at de.uka.ipd.idaho.im.imagine.plugins.tools.ImageMarkupToolManager$DpImageMarkupTool.process(ImageMarkupToolManager.java:549) at de.uka.ipd.idaho.im.imagine.batch.GoldenGateImagineBatch.main(GoldenGateImagineBatch.java:703) Processing document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.2.pdf'

Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Add Document Meta Data' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Taxonomic Keys' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Error processing document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.11.pdf': Unexpected character: @ near index 27 '(' "[a-z]{2,4}\.?" @:part ')' ^ java.util.regex.PatternSyntaxException: Unexpected character: @ near index 27 '(' "[a-z]{2,4}\.?" @:part ')' ^ at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getPattern(AnnotationPatternMatcher.java:902) at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getMatchTrees(AnnotationPatternMatcher.java:704) at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getMatches(AnnotationPatternMatcher.java:685) at de.uka.ipd.idaho.gamta.util.AnnotationPatternMatcher.getMatches(AnnotationPatternMatcher.java:656) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getNumberDetailBlocks(RefParse.java:6852) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBaseDetails(RefParse.java:2285) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:973) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:944) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:41) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:36) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseAnalyzer.process(RefParseAnalyzer.java:59) at de.uka.ipd.idaho.goldenGate.plugin.analyzers.AnalyzerManager$AnalyzerDocumentProcessor.process(AnalyzerManager.java:207) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:329) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:295) at de.uka.ipd.idaho.im.imagine.plugins.tools.ImageMarkupToolManager$DpImageMarkupTool.process(ImageMarkupToolManager.java:549) at de.uka.ipd.idaho.im.imagine.batch.GoldenGateImagineBatch.main(GoldenGateImagineBatch.java:703) Processing document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.2.pdf'

gsautter commented 6 years ago

Looks like an outdated base JAR, namely Gamta.jar. The new RefParse version uses some recently added features in pattern matching, and those don't seem to exist in that run (NoSuchMethodError almost always is something like this).

I've just packed an update including the latest additions and put it online. If it doesn't download automatically, you can manually get it from http://tb.plazi.org/GgServer/Updates/GgUpdate.IM.20180926-1628.zip . Unzip that in your GGI root folder and it should work as supposed to. Thanks to intermediate result caching, you won't have to re-decode all the PDFs again, as processing will resume with the bibliography (the step the error occurred in).

myrmoteras commented 6 years ago

this works better, but still these two so far do now work this does not work at all zootaxa.4483.1.1

and this has the problems below. The rest is running right now zootaxa.4483.1.3

Doing catalog lookups for 1 catalog salvaged seeds: Extending primary rank dictionaries Salvaging style eliminated seeds using style of catalog salvaged seeds Doing catalog lookups for 1 style salvaged seeds: Extending primary rank dictionaries Indexing epithet occurrences Indexing labeled epithets and potential epithets Indexing taxonomic status labels Indexing potential authorities Indexing name infixes and in-name symbols Assembling taxonomic names Expanding abbreviated genera and species Linking taxonomic names with non-catalog genera to families Linking taxonomic names to catalog data Getting taxonomic names Adding document authority for original names Adding or resetting verbatim authorities Bucketizing taxonomic names Merging equal buckets Merging compatible buckets Handling new combinations and status changes Transferring attributes already present in document Sorting out done-with buckets Loading authority data for 51 taxon names Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Clean Table Annotations' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Remove Duplicate Annotations' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Treatments (Headings Only)' Wrapping document Checking document Loading document processor Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Extract Materials Citations' Wrapping document Checking document Loading document processor Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Detecting materials citation paragraphs Getting document person names Marking materials citations Setting materials citation attributes Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Check Annotation Nesting' Indexing Annotations by Text Stream Checking Annotations Storing document to 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.3.pdf.imf' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored Wrapping document Getting treatments Getting taxa Building reference string Preparing export file Exporting meta.xml Exporting eml.xml Exporting taxa.txt Exporting occurrences.txt Exporting descriptions.txt Exporting distribution.txt Exporting media.txt Exporting references.txt Exporting vernaculars.txt Finishing export Error exporting document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.3.pdf' via 'Export Figures & Tables': No captions whose targets to export. java.io.IOException: No captions whose targets to export. at de.uka.ipd.idaho.goldenGateServer.plazi.imagine.FigureTableDocumentExporter.exportDocument(FigureTableDocumentExporter.java:214) at de.uka.ipd.idaho.goldenGateServer.plazi.imagine.FigureTableDocumentExporter.exportDocument(FigureTableDocumentExporter.java:202) at de.uka.ipd.idaho.im.imagine.batch.GoldenGateImagineBatch.main(GoldenGateImagineBatch.java:756) Processing document 'E:\diglib\zootaxa\temp19\zootaxa.4483.1.4.pdf'

gsautter commented 6 years ago

That error is not really a problem, more a notification that there was nothing to export (neither figures nor tables) to the <docName>.figuresTables.zip.

The other PDF I'll investigate. However, for what it looks like, it stands in FAT, doing lookups in CoL. Those might take a little in case of cache misses, especially for 48 (potential) genera.

gsautter commented 6 years ago

Just ran zootaxa.4483.1.1, and it worked fine ... it's processed up to including the bibliography and citations, UUID is 2157AB13FF9A353DD007FFCDC556FFF1.

gsautter commented 6 years ago

zootaxa.4483.1.3.pdf apparently did work after all ... just opened it from the server to check for any problems, only to find a fully marked up document.