GGI new built: error in parsing

myrmoteras commented 6 years ago

GgImagineBatch.20180514-0655.out.zip

this is the log of the parsing of a zootaxa.4419.1.1 97MB. if you need to the file, led me know

gsautter commented 6 years ago

No need for the PDF, thanks. This looks like a NoSuchMethodError, happening where document structure detection tries to use one of the new layout analysis functions I've added to ImageMarkup.jar.

If you downloaded the entire new build, you should have the latest ImageMarkup.jar ... please check timestamp and exact size (in bytes) for comparison. An update with the most recent version is available from http://tb.plazi.org/GgServer/Updates/GgUpdate.IM.20180514-0720.zip now (had mis-labeled it late on Friday).

gsautter commented 6 years ago

As mentioned in #464 (now closed as a duplicate of this one), the size of ImageMarkup.jar should be 942.121 bytes. If it is anything else, please download http://tb.plazi.org/GgServer/Updates/GgUpdate.IM.20180514-0720.zip, un-zip its contents into your GGI root folder, and try again. Also, please make sure the batch runs in that very folder.

myrmoteras commented 6 years ago

The instruction yesterday have not been clear. I understand under a "new version" that I get a new version not just some jar files that I have to copy into the root directory, what you explain in this mornings explanation.

myrmoteras commented 6 years ago

I did what you wreote, but it does not work for this file https://drive.google.com/file/d/1FttXceZLo9cz1T6aeEt_v8gXdQIizjOm/view?usp=sharing

D:\GoldenGateImagine20170823>java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" DT=D CACHE=./BatchCache FM=U Loading parameters GoldenGATE Imagine core created, configuration is Default.imagine Image Markup Tool 'StructureDetector' loaded Image Markup Tool 'MetaDataAdder' loaded Image Markup Tool 'KeyHandler' loaded Image Markup Tool 'ParseBibliography.imTool' loaded Image Markup Tool 'MarkBibRefCitations.imTool' loaded Image Markup Tool 'MarkTaxonNames.imTool' loaded Image Markup Tool 'TableAnnotCleaner' loaded Image Markup Tool 'RemoveDuplicateAnnots' loaded Image Markup Tool 'TreatmentTaggerStyled.imTool' loaded Image Markup Tool 'ExtractMaterialsCitations.imTool' loaded Image Markup Tool 'CheckAnnotNesting' loaded Processing document 'E:\diglib\zootaxa\temp\zootaxa.4419.1.1.pdf'

document restored from previous batch run
assigned document style 'zootaxa.2013.monograph' Running Image Markup Tool 'Detect Document Structure' Detecting document structure from 70 pages
gathering data
scoring page numbers:
checking page number sequence:
filling in missing page numbers:
detecting page headers
correcting empty tables
detecting tables
computing main text font size
detecting captions
detecting footnotes
indexing paragraph end words
indexing document words
de-hyphenating line breaks
merging interrupted paragraphs
identifying caption target areas
merging tables within pages
merging tables across pages
marking caption citations
marking headings and emphases Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Add Document Meta Data' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Taxonomic Keys' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Bibliographic Citations' Wrapping document Checking document Loading document processor Processing document Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Taxon Names' Wrapping document Checking document Loading document processor Collecting seed families and genera Collecting above-genus seeds Eliminating seeds from document head and bibliography Eliminating stop words Eliminating seeds with inappropriate suffixes Eliminating document local stop words Eliminating non-emphasis-start words Eliminating seeds occurring in lower case more often than capitalized Collecting labeled positives Doing catalog lookups for 83 seeds: Assessing catalog lookup results Building primary rank dictionaries Salvaging style eliminated seeds using lookup results Doing catalog lookups for 2 catalog salvaged seeds: Extending primary rank dictionaries Salvaging style eliminated seeds using style of catalog salvaged seeds Doing catalog lookups for 0 style salvaged seeds: Extending primary rank dictionaries Indexing epithet occurrences Indexing labeled epithets and potential epithets Indexing taxonomic status labels Indexing potential authorities Indexing name infixes and in-name symbols Assembling taxonomic names Expanding abbreviated genera and species Linking taxonomic names with non-catalog genera to families Linking taxonomic names to catalog data Getting taxonomic names Adding document authority for original names Adding or resetting verbatim authorities Bucketizing taxonomic names Merging equal buckets Merging compatible buckets Handling new combinations and status changes Transferring attributes already present in document Sorting out done-with buckets Loading authority data for 43 taxon names Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Clean Table Annotations' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Remove Duplicate Annotations' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Treatments (Headings Only)' Wrapping document Checking document Loading document processor Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Extract Materials Citations' Wrapping document Checking document Loading document processor

D:\GoldenGateImagine20170823> GgImagineBatch.20180515-0843.out.zip

myrmoteras commented 6 years ago

these files did not process either zootaxa.4420.1.2.pdf zootaxa.4420.1.3.pdf zootaxa.4420.1.4.pdf

gsautter commented 6 years ago

Regarding getting a new version: like there always was with GG Editor, there are two parts of GG Imagine ... one is the configuration, with all the gizmos, etc., and with an in-installation update. The other part is the application core, which (among a lot of other things) provides libraries (packed in JARs) with basic functionality that the gizmos use and build upon. If you only do an online configuration update, you won't get the latest application core updates, so you might (an in this instance did) end up with a new version of a gizmo (document structure detector in this instance) that relies and depends upon application core functions that just did not come in. The auto-update for the application core should normally keep the latter up to date as well, but in case that does not happen for some reason, errors do happen, and maybe some care for the application core is required.

Now I'm well aware that there is a bit of complexity here, as well as a bit of a challenge to the user ... but then, I'm sorry I'm neither Microsoft nor Apple nor Adobe nor even Google. Just no way of living up to these standards while still keeping on innovating, let alone in my off-dayjob time. But still sorry you had to manually replace some JAR ...

gsautter commented 6 years ago

For what your latest report shows, the batch got all the way to the materials citations. Now looking at the ZIP of logs you provide as well (thankfully), it looks like materials citation processing broke off in the middle of some up-front detail tagging, namely during a lookup in a CSV backed thesaurus of potential collection codes ... not sure I touched CSV lookups in the past two years, actually ...

Hard to tell why processing actually broke off, as there is not even the slightest indication of anything going south ... unless you killed it manually. Only thing I could imagine is the batch running out of memory. That said, what happens if you just re-run the batch one these PDFs? Due to intermediate result preservation, it should pick up with the first unfinished step, which at least for the PDF you provide the log for is the materials citations. If there ever was a memory issue or anything in that department, a mere restart should get the job finished.

myrmoteras commented 6 years ago

@gsautter it is not a question of google-to-be: it is simply a question of communication. You need just to add one sentence explaining what to do with the Zip file. You cannot cut down on expressivness in our communication beyond a certain degree, otherwise the interaction falters. Similarly to you complaining about missing information on the not processable source files, which is for me clear when I write an issue.

myrmoteras commented 6 years ago

I will run the files individually. Will happen after Friday when I am off the hook from a review pannel. the memory should not be the issue, but it might be that it takes VERY long (>20 min) to process when I normally kill it.

gsautter commented 6 years ago

Regarding the Google-to-be ... I'm trying hard to live and code and build-release to these standards, but doing so in my own time for now there simply is some inherent limits ...

On the technical level, I kind of cannot seem to shake the feeling that there still is some installation specific issue with the application core auto-update on your machine ... so I'd like to propose the following:

Download the latest GGI build from http://tb.plazi.org/GgServer/Downloads/GgImagine-Default.imagine.zip
Unzip it into some new empty folder of your choice (let's call it "ALLNEW")
Copy the contents of folder "ALLNEW", but NOT its sub folders, into the root folder of your main GGI installation, replacing any files that might be already there

This will replace your application core with the latest version, but without affecting any caches or anything stored in your continuously maintained configurations.

gsautter commented 6 years ago

Regarding materials citations stalling (yet again), I will investigate. But that is an independent issue altogether, as the routines involved in materials citation handling have not changed in the past half year or so.

myrmoteras commented 6 years ago

C:\Users\Donat>d:

D:>cd GoldenGateImagine20170823

D:\GoldenGateImagine20170823>java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" DT=D CACHE=./BatchCache FM=U Loading parameters GoldenGATE Imagine core created, configuration is Default.imagine Image Markup Tool 'StructureDetector' loaded Image Markup Tool 'MetaDataAdder' loaded Image Markup Tool 'KeyHandler' loaded Image Markup Tool 'ParseBibliography.imTool' loaded Image Markup Tool 'MarkBibRefCitations.imTool' loaded Image Markup Tool 'MarkTaxonNames.imTool' loaded Image Markup Tool 'TableAnnotCleaner' loaded Image Markup Tool 'RemoveDuplicateAnnots' loaded Image Markup Tool 'TreatmentTaggerStyled.imTool' loaded Image Markup Tool 'ExtractMaterialsCitations.imTool' loaded Image Markup Tool 'CheckAnnotNesting' loaded Processing document 'E:\diglib\zootaxa\temp\zootaxa.4420.2.1.pdf'

loaded PDF of 4343051 bytes Extracting page content Assessing font char usage Importing page words Sanitizing page words and graphics Removing vertical watermarks on page margins Assessing page content Handling flipped page content Analyzing words in relation to figures and graphics Storing custom fonts Assessing number punctuation Splitting page words Generating page images Generating pages Analyzing page structure Analyzing text stream structure
PDF converted, document ID is '5E26077BFFB0C62D82490048FFA98265' Storing conversion result to temporary folder Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder
assigned document style 'zootaxa.2013.journal_article' Running Image Markup Tool 'Detect Document Structure' Detecting document structure from 29 pages
gathering data
scoring page numbers:
checking page number sequence:
filling in missing page numbers:
detecting page headers
correcting empty tables
detecting tables
computing main text font size
detecting captions
detecting footnotes
indexing paragraph end words
indexing document words
de-hyphenating line breaks
merging interrupted paragraphs
identifying caption target areas
merging tables within pages
merging tables across pages
marking caption citations
marking headings and emphases Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Add Document Meta Data' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Taxonomic Keys' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Parse Bibliography' Wrapping document Checking document Loading document processor Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Bibliographic Citations' Wrapping document Checking document Loading document processor Processing document Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Taxon Names' Wrapping document Checking document Loading document processor Collecting seed families and genera Collecting above-genus seeds Eliminating seeds from document head and bibliography Eliminating stop words Eliminating seeds with inappropriate suffixes Eliminating document local stop words Eliminating non-emphasis-start words Eliminating seeds occurring in lower case more often than capitalized Collecting labeled positives Doing catalog lookups for 20 seeds: Assessing catalog lookup results Building primary rank dictionaries Salvaging style eliminated seeds using lookup results Doing catalog lookups for 0 catalog salvaged seeds: Extending primary rank dictionaries Salvaging style eliminated seeds using style of catalog salvaged seeds Doing catalog lookups for 0 style salvaged seeds: Extending primary rank dictionaries Indexing epithet occurrences Indexing labeled epithets and potential epithets Indexing taxonomic status labels Indexing potential authorities Indexing name infixes and in-name symbols Assembling taxonomic names Expanding abbreviated genera and species Linking taxonomic names with non-catalog genera to families Linking taxonomic names to catalog data Getting taxonomic names Adding document authority for original names Adding or resetting verbatim authorities Bucketizing taxonomic names Merging equal buckets Merging compatible buckets Handling new combinations and status changes Transferring attributes already present in document Sorting out done-with buckets Loading authority data for 57 taxon names Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Clean Table Annotations' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Remove Duplicate Annotations' Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Mark Treatments (Headings Only)' Wrapping document Checking document Loading document processor Storing document data Storing page data Storing word data Storing region data Storing annotation data Storing font data Storing page images Storing page image data Storing supplement data Storing supplements Document stored to temporary folder Running Image Markup Tool 'Extract Materials Citations' Wrapping document Checking document Loading document processor

D:\GoldenGateImagine20170823>java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" DT=D CACHE=./BatchCache FM=U Loading parameters GoldenGATE Imagine core created, configuration is Default.imagine Image Markup Tool 'StructureDetector' loaded Image Markup Tool 'MetaDataAdder' loaded Image Markup Tool 'KeyHandler' loaded Image Markup Tool 'ParseBibliography.imTool' loaded Image Markup Tool 'MarkBibRefCitations.imTool' loaded Image Markup Tool 'MarkTaxonNames.imTool' loaded Image Markup Tool 'TableAnnotCleaner' loaded Image Markup Tool 'RemoveDuplicateAnnots' loaded Image Markup Tool 'TreatmentTaggerStyled.imTool' loaded Image Markup Tool 'ExtractMaterialsCitations.imTool' loaded Image Markup Tool 'CheckAnnotNesting' loaded Processing document 'E:\diglib\zootaxa\temp\zootaxa.4420.2.1.pdf'

document restored from previous batch run
assigned document style 'zootaxa.2013.journal_article' Skipping previously-run Image Markup Tool 'Detect Document Structure' Skipping previously-run Image Markup Tool 'Add Document Meta Data' Skipping previously-run Image Markup Tool 'Mark Taxonomic Keys' Skipping previously-run Image Markup Tool 'Parse Bibliography' Skipping previously-run Image Markup Tool 'Mark Bibliographic Citations' Skipping previously-run Image Markup Tool 'Mark Taxon Names' Skipping previously-run Image Markup Tool 'Clean Table Annotations' Skipping previously-run Image Markup Tool 'Remove Duplicate Annotations' Skipping previously-run Image Markup Tool 'Mark Treatments (Headings Only)' Running Image Markup Tool 'Extract Materials Citations' Wrapping document Checking document Loading document processor

D:\GoldenGateImagine20170823> zootaxa.4420.2.1.pdf

gsautter commented 6 years ago

OK, this tells me we have some error in the materials citation parsing stage of the batch ... even though I didn't change anything there in the last build, and neither in the preceding builds.

Any chance the error log might shed some light? I do see materials citation extraction breaks at some early point, but unfortunately without any hints as to the why.

gsautter commented 6 years ago

I you have uploaded the IMFs without the materials citations to the server, an IMF UUID would vastly speed up my investigation into this issue.

gsautter commented 6 years ago

Also, does the batch just terminate (suggesting some explicit and error log resident execution problem), or does it seem to hang and just take up a lot of CPU cycles?

Would help a great deal with resolving this issue ...

myrmoteras commented 6 years ago

I don't have an IMF for any of these articles. I make the PDF accessible to you so you can run some as batch on your machine. Then we see whether or not they work on your machine and you have the error log. When I run the batch the log is typically to big to export.

gsautter commented 6 years ago

Regarding uploading the IMF, I think I've explained before how to load a partially processed document from the batch cache and put it on the server. Was just asking, though ...

gsautter commented 6 years ago

Regarding the error log, if something goes wrong in a programming kind of sense, especially on re-running a batch processing job after killing it off in the Task Manager, the batch should break down pretty much immediately, leaving a log file that should be small enough to handle.

On the other hand, if any step left in the batch to handle after a restart runs into a prohibitive combinatoric explosion with some regular expression pattern, that would both explain the prohibitive log file size and hint yet another somewhat pathological case of data to consider ... hence the question about whether batch processing just broke down, or whether you killed it after several (or tens of) minutes of apparently going nowhere. Which one is it?

myrmoteras commented 6 years ago

This is the log after running the ten files, of which not all processed. So I can't give you an answer. The log does is for a batch and not a single file.

If the batch runs in a loop, then I need to kill the process. In this new situation, files are just not processed without killing the entire batch

gsautter commented 6 years ago

Well, if you run the batch with DATA=E:\diglib\zootaxa\temp\zootaxa.4420.2.1.pdf, you will definitely get the result and error log for a single PDF ... Please do so and help me shed some light on what the problem might be.

gsautter / goldengate-imagine

GGI new built: error in parsing #463