batch problems: don't work

myrmoteras commented 6 years ago

@gsautter can you please run these files in your system as batch? They don't work here run this

java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" CACHE=./BatchCache FM=U

https://drive.google.com/open?id=1zuD8LbnBy4uixmNgBslkCENXOtfW0bP3

gsautter commented 6 years ago

Will do.

myrmoteras commented 6 years ago

I added some more files, you can also see what did and what did not work here https://docs.google.com/spreadsheets/d/1q-VskOXT87Qt_Mj1MNjGFZOdCKq4UDDQgxTe6WXaxDs/edit#gid=0 all that marked in orange did not work. These are not showstoppers, but just did not process.

myrmoteras commented 6 years ago

some more from 80 processed articles today in the above directory, total 31

gsautter commented 6 years ago

I've been all over this all day ... something really strange is going on here. And I'm trying hard to figure out what.

In particular, memory consumption goes way higher with the batch that when converting the very same PDF with my test program, with the same font decoding mode and all. Trying hard to narrow it down, and most likely will hit my forehead with my palm like really hard once I figure it out ...

gsautter commented 6 years ago

Especially in longer documents, it seems reference tagging (not parsing) is going on some kind of a combinatoric rampage ... figuring out counter measures.

gsautter commented 6 years ago

And only a few PDFs after I figured it might help to pre-filter paragraphs by the number of words they contain (by a maximum), zootaxa.4476.1.4.pdf comes up with a reference to a Nature article with some 40 to 50 authors explicitly listed ... afraid I'll have to find some other way of handling such behemoths.

gsautter commented 6 years ago

Guess I was estimating a little too conservatively ... that reference has 171 (!!!) author names explicitly listed ... have to find a way of preventing a runaway in such however rare cases.

myrmoteras commented 6 years ago

here is one more monograph that did not work today, zootaxa.4487.1.1.pdf

gsautter commented 6 years ago

The original 15 problem PDFs are all processed successfully now, taking on the ones added since next thing.

gsautter commented 6 years ago

Problems I solved were most prominently these two:

combinatoric runaway of extraction of excessive author lists (up to 171 authors)
memory issues incurred by line drawings scanned at excessive resolution (up to 1770 DPI)

Fixes come with next update, but I want to test them on the other PDF first.

myrmoteras commented 6 years ago

@gsautter how do you want to proceed with the nonprocessing PDFs? I just keep adding PDFs to the directory https://docs.google.com/spreadsheets/d/1q-VskOXT87Qt_Mj1MNjGFZOdCKq4UDDQgxTe6WXaxDs/edit#gid=0

Please explain me how you envision to get them processed. Shall I a wait until a new built comes and then run them all again? Or will you process them and then let me know?

gsautter commented 6 years ago

Well, since my only way of figuring out why these PDFs don't process is to run them, is there all too much of a point in you running them again later? Some of them are pretty large ... Question is how I pass you the (even larger) IMFs to do whatever checks you do before uploading them. What's the limitation on the Google Drive?

gsautter commented 6 years ago

On the plus side, I got all but 2 in the trouble folder processing now, working hard to get the remaining 2 to run.

myrmoteras commented 6 years ago

I guess, the best would be to run them again to see, whether the new build is ok. Because otherwise I have to go through the document stepwise and that takes more time.

If they process fine, then you might as well upload them to BLR. The only thing you might want to check by opening the IMF is to see whether the fonts are alright and all the figures caption have been discovered. Other errors I assume we can check later. In this case, you need me just let know which articles you have processed by providing a list of UUIDs

gsautter commented 6 years ago

OK, I'll try and fix the remaining 2, then make a new build.

Once you tell me the new build works fine, either one of us can upload them.

gsautter commented 6 years ago

One more down, just a materials citation runaway to go (zootaxa.4470.1.1.pdf).

myrmoteras commented 6 years ago

why does it take time to make a new built? why can't you just create a new built once you fixed a bug?

myrmoteras commented 6 years ago

here more this is a showstopper: zootaxa.4454.1.7.pdf

gsautter commented 6 years ago

You're right ... new build is out, and processed the latest error file (zootaxa.4454.1.7.pdf) just fine.

It's just the occasional materials citation runaway pending now.

myrmoteras commented 6 years ago

@gsautter I am running all the files again https://drive.google.com/open?id=1zuD8LbnBy4uixmNgBslkCENXOtfW0bP3

I stopped after the first 3 that do not work.

Then I remembered to remove in \batchCache\TempDocs all the files and this seems to work now - cross fingers..

zootaxa.4444.5.3 does not work Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Error processing document 'E:\diglib\zootaxa\temp28\zootaxa.4444.5.3.pdf': Annotation size out of bounds: 0 java.lang.RuntimeException: Annotation size out of bounds: 0 at de.uka.ipd.idaho.gamta.Gamta.newAnnotation(Gamta.java:415) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.trimUrlOrDoi(RefParse.java:2404) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBrokenURLsAndDOIs(RefParse.java:2334) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBaseDetails(RefParse.java:2215) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:973) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:944) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:41) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:36) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseAnalyzer.process(RefParseAnalyzer.java:59) at de.uka.ipd.idaho.goldenGate.plugin.analyzers.AnalyzerManager$AnalyzerDocumentProcessor.process(AnalyzerManager.java:207) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:329) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:295) at de.uka.ipd.idaho.im.imagine.plugins.tools.ImageMarkupToolManager$DpImageMarkupTool.process(ImageMarkupToolManager.java:549) at de.uka.ipd.idaho.im.imagine.batch.GoldenGateImagineBatch.main(GoldenGateImagineBatch.java:703) ...................

Processing document 'E:\diglib\zootaxa\temp28\zootaxa.4445.1.1.pdf'

document restored from previous batch run
assigned document style 'zootaxa.2013.monograph' Skipping previously-run Image Markup Tool 'Detect Document Structure' Skipping previously-run Image Markup Tool 'Add Document Meta Data' Skipping previously-run Image Markup Tool 'Mark Taxonomic Keys' Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Error processing document 'E:\diglib\zootaxa\temp28\zootaxa.4445.1.1.pdf': GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Unknown Source) at java.util.Vector.grow(Unknown Source) at java.util.Vector.ensureCapacityHelper(Unknown Source) at java.util.Vector.addElement(Unknown Source) at de.uka.ipd.idaho.stringUtils.StringVector.addElement(StringVector.java:108) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse$Structure.(RefParse.java:4895) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getStructures(RefParse.java:2650) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getStructures(RefParse.java:2672) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getStructures(RefParse.java:2672) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getStructures(RefParse.java:2672) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getStructures(RefParse.java:2672) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getStructures(RefParse.java:2643) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:1063) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:944) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:41) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:36) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseAnalyzer.process(RefParseAnalyzer.java:59) at de.uka.ipd.idaho.goldenGate.plugin.analyzers.AnalyzerManager$AnalyzerDocumentProcessor.process(AnalyzerManager.java:207) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:329) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:295) at de.uka.ipd.idaho.im.imagine.plugins.tools.ImageMarkupToolManager$DpImageMarkupTool.process(ImageMarkupToolManager.java:549) at de.uka.ipd.idaho.im.imagine.batch.GoldenGateImagineBatch.main(GoldenGateImagineBatch.java:703) Processing document 'E:\diglib\zootaxa\temp28\zootaxa.4446.2.1.pdf'
document restored from previous batch run
assigned document style 'zootaxa.2013.journal_article' Skipping previously-run Image Markup Tool 'Detect Document Structure' Skipping previously-run Image Markup Tool 'Add Document Meta Data' Skipping previously-run Image Markup Tool 'Mark Taxonomic Keys' Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Error processing document 'E:\diglib\zootaxa\temp28\zootaxa.4446.2.1.pdf': Annotation size out of bounds: 0 java.lang.RuntimeException: Annotation size out of bounds: 0 at de.uka.ipd.idaho.gamta.Gamta.newAnnotation(Gamta.java:415) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.trimUrlOrDoi(RefParse.java:2404) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBrokenURLsAndDOIs(RefParse.java:2334) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.getBaseDetails(RefParse.java:2215) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:973) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParse.parseBibRefs(RefParse.java:944) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:41) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseInteractive.processBibRefs(RefParseInteractive.java:36) at de.uka.ipd.idaho.plugins.bibRefs.refParse.RefParseAnalyzer.process(RefParseAnalyzer.java:59) at de.uka.ipd.idaho.goldenGate.plugin.analyzers.AnalyzerManager$AnalyzerDocumentProcessor.process(AnalyzerManager.java:207) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:329) at de.uka.ipd.idaho.goldenGate.plugin.pipelines.PipelineManager$PipelineDocumentProcessor.process(PipelineManager.java:295) at de.uka.ipd.idaho.im.imagine.plugins.tools.ImageMarkupToolManager$DpImageMarkupTool.process(ImageMarkupToolManager.java:549) at de.uka.ipd.idaho.im.imagine.batch.GoldenGateImagineBatch.main(GoldenGateImagineBatch.java:703) Processing document 'E:\diglib\zootaxa\temp28\zootaxa.4449.1.1.pdf'
document restored from previous batch run
assigned document style 'zootaxa.2013.monograph' Skipping previously-run Image Markup Tool 'Detect Document Structure' Skipping previously-run Image Markup Tool 'Add Document Meta Data' Skipping previously-run Image Markup Tool 'Mark Taxonomic Keys' Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor

myrmoteras commented 6 years ago

these ones stops because of bibliographic issues. zootaxa.4445.1.1 zootaxa.4453.1.1 zootaxa.4477.1.1

there ones do not process (generally "annotation size out of bounds:0 comment) zootaxa.4446.2.1 zootaxa.4449.1.1 zootaxa.4466.1.6 zootaxa.4466.1.14 zootaxa.4466.1.16 zootaxa.4469.1.1 zootaxa.4471.1.3 zootaxa.4471.1.6 zootaxa.4472.2.6 zootaxa.4472.3.2 zootaxa.4472.3.11 zootaxa.4476.1.4 zootaxa.4481.1.1 zootaxa.4482.1.1 zootaxa.4483.1.8 zootaxa.4486.4.2

these seem to have materialsCitation issues zootaxa.4462.2.2 zootaxa.4466.1.4 zootaxa.4470.1.1

gsautter commented 6 years ago

The "Annotation size out of bounds: 0" also is a bibliography issue, namely a glitch in broken URL recovery, but one I fixed a good bit ago ...

The materials citations are next.

For the bibliography issues, and other things that appear to re-surface a good while after fixing, I've just created a full new build. Let's hope this finally leaves past bugs in the past.

myrmoteras commented 6 years ago

ok - I am running with the new download the remaining files again.

these don't work and hung materialsCitation issue zootaxa.4462.2.2 zootaxa.4470.1.1 (this is a paper with very densely packed mc. It run a couple of MC and then stalled

don't process zootaxa.4466.1.16 zootaxa.4472.3.11

gsautter commented 6 years ago

Sorry for the two with the materials citations. As mentioned above, I'll take on these next. The other two I'll check out.

gsautter commented 6 years ago

The other two (zootaxa.4466.1.16 and zootaxa.4472.3.11) actually process just fine. The two error messages in the end are merely due to a failed (local) DwC-A export, which is additional to the processing result. We added that for demo purposes originally, and could easily remove it without incurring any changes in the resulting IMF proper.

The failure messages for DwC-A creation are somewhat different, though:

The one for zootaxa.4466.1.16 simply states that there are no treatments, which, looking at the PDF, is spot-on, as there aren't any.
The one for zootaxa.4472.3.11 complains about invalid metadata, which might well be due to the fact that it's an erratum, whose layout may deviate from an article and thus elude the style template (the arrangement of author names and title jump my eye here). Plus, that one doesn't contain any treatments, either.

myrmoteras commented 6 years ago

I am trying to clean up loose ends here so some more files to follow more materialsCitation zootaxa.4358.1.2 zootaxa.4437.1.1

myrmoteras commented 6 years ago

The other two (zootaxa.4466.1.16 and zootaxa.4472.3.11) they did not process here, so you might process them and upload to BLR?!

gsautter commented 6 years ago

Thanks for the additional materials citation test case.

For the other two (zootaxa.4466.1.16 and zootaxa.4472.3.11), could you share the batch output? Would be helpful to know where they fail ... I did get an export error, too, see above, but that happens after processing is finished (and has no detrimental effect at all) - it merely states "cannot export DwC-A for lack of treatments".

myrmoteras commented 6 years ago

this one has a font error, even using the latest build zootaxa.4438.3.3

myrmoteras commented 6 years ago

additional show stoppers materialsCitation zootaxa.4456.1.1 zootaxa.4084.2.4 zootaxa.4084.2.5 zootaxa.4084.2.8 zootaxa.4084.3.1 zt03789p072 zt03792p534

myrmoteras commented 6 years ago

I added the batch output to the GGDirectory

myrmoteras commented 6 years ago

bibRefs issues? zt03767p256

gsautter / goldengate-imagine

batch problems: don't work #501