gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine
Other
1 stars 0 forks source link

GGI new built: error in parsing #463

Open myrmoteras opened 6 years ago

myrmoteras commented 6 years ago

GgImagineBatch.20180514-0655.out.zip

this is the log of the parsing of a zootaxa.4419.1.1 97MB. if you need to the file, led me know

gsautter commented 6 years ago

No need for the PDF, thanks. This looks like a NoSuchMethodError, happening where document structure detection tries to use one of the new layout analysis functions I've added to ImageMarkup.jar.

If you downloaded the entire new build, you should have the latest ImageMarkup.jar ... please check timestamp and exact size (in bytes) for comparison. An update with the most recent version is available from http://tb.plazi.org/GgServer/Updates/GgUpdate.IM.20180514-0720.zip now (had mis-labeled it late on Friday).

gsautter commented 6 years ago

As mentioned in #464 (now closed as a duplicate of this one), the size of ImageMarkup.jar should be 942.121 bytes. If it is anything else, please download http://tb.plazi.org/GgServer/Updates/GgUpdate.IM.20180514-0720.zip, un-zip its contents into your GGI root folder, and try again. Also, please make sure the batch runs in that very folder.

myrmoteras commented 6 years ago

The instruction yesterday have not been clear. I understand under a "new version" that I get a new version not just some jar files that I have to copy into the root directory, what you explain in this mornings explanation.

myrmoteras commented 6 years ago

I did what you wreote, but it does not work for this file https://drive.google.com/file/d/1FttXceZLo9cz1T6aeEt_v8gXdQIizjOm/view?usp=sharing

D:\GoldenGateImagine20170823>java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" DT=D CACHE=./BatchCache FM=U Loading parameters GoldenGATE Imagine core created, configuration is Default.imagine Image Markup Tool 'StructureDetector' loaded Image Markup Tool 'MetaDataAdder' loaded Image Markup Tool 'KeyHandler' loaded Image Markup Tool 'ParseBibliography.imTool' loaded Image Markup Tool 'MarkBibRefCitations.imTool' loaded Image Markup Tool 'MarkTaxonNames.imTool' loaded Image Markup Tool 'TableAnnotCleaner' loaded Image Markup Tool 'RemoveDuplicateAnnots' loaded Image Markup Tool 'TreatmentTaggerStyled.imTool' loaded Image Markup Tool 'ExtractMaterialsCitations.imTool' loaded Image Markup Tool 'CheckAnnotNesting' loaded Processing document 'E:\diglib\zootaxa\temp\zootaxa.4419.1.1.pdf'

D:\GoldenGateImagine20170823> GgImagineBatch.20180515-0843.out.zip

myrmoteras commented 6 years ago

these files did not process either zootaxa.4420.1.2.pdf zootaxa.4420.1.3.pdf zootaxa.4420.1.4.pdf

gsautter commented 6 years ago

Regarding getting a new version: like there always was with GG Editor, there are two parts of GG Imagine ... one is the configuration, with all the gizmos, etc., and with an in-installation update. The other part is the application core, which (among a lot of other things) provides libraries (packed in JARs) with basic functionality that the gizmos use and build upon. If you only do an online configuration update, you won't get the latest application core updates, so you might (an in this instance did) end up with a new version of a gizmo (document structure detector in this instance) that relies and depends upon application core functions that just did not come in. The auto-update for the application core should normally keep the latter up to date as well, but in case that does not happen for some reason, errors do happen, and maybe some care for the application core is required.

Now I'm well aware that there is a bit of complexity here, as well as a bit of a challenge to the user ... but then, I'm sorry I'm neither Microsoft nor Apple nor Adobe nor even Google. Just no way of living up to these standards while still keeping on innovating, let alone in my off-dayjob time. But still sorry you had to manually replace some JAR ...

gsautter commented 6 years ago

For what your latest report shows, the batch got all the way to the materials citations. Now looking at the ZIP of logs you provide as well (thankfully), it looks like materials citation processing broke off in the middle of some up-front detail tagging, namely during a lookup in a CSV backed thesaurus of potential collection codes ... not sure I touched CSV lookups in the past two years, actually ...

Hard to tell why processing actually broke off, as there is not even the slightest indication of anything going south ... unless you killed it manually. Only thing I could imagine is the batch running out of memory. That said, what happens if you just re-run the batch one these PDFs? Due to intermediate result preservation, it should pick up with the first unfinished step, which at least for the PDF you provide the log for is the materials citations. If there ever was a memory issue or anything in that department, a mere restart should get the job finished.

myrmoteras commented 6 years ago

@gsautter it is not a question of google-to-be: it is simply a question of communication. You need just to add one sentence explaining what to do with the Zip file. You cannot cut down on expressivness in our communication beyond a certain degree, otherwise the interaction falters. Similarly to you complaining about missing information on the not processable source files, which is for me clear when I write an issue.

myrmoteras commented 6 years ago

I will run the files individually. Will happen after Friday when I am off the hook from a review pannel. the memory should not be the issue, but it might be that it takes VERY long (>20 min) to process when I normally kill it.

gsautter commented 6 years ago

Regarding the Google-to-be ... I'm trying hard to live and code and build-release to these standards, but doing so in my own time for now there simply is some inherent limits ...

On the technical level, I kind of cannot seem to shake the feeling that there still is some installation specific issue with the application core auto-update on your machine ... so I'd like to propose the following:

This will replace your application core with the latest version, but without affecting any caches or anything stored in your continuously maintained configurations.

gsautter commented 6 years ago

Regarding materials citations stalling (yet again), I will investigate. But that is an independent issue altogether, as the routines involved in materials citation handling have not changed in the past half year or so.

myrmoteras commented 6 years ago

@gsautter this did not work Microsoft Windows [Version 10.0.17134.48] (c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\Donat>d:

D:>cd GoldenGateImagine20170823

D:\GoldenGateImagine20170823>java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" DT=D CACHE=./BatchCache FM=U Loading parameters GoldenGATE Imagine core created, configuration is Default.imagine Image Markup Tool 'StructureDetector' loaded Image Markup Tool 'MetaDataAdder' loaded Image Markup Tool 'KeyHandler' loaded Image Markup Tool 'ParseBibliography.imTool' loaded Image Markup Tool 'MarkBibRefCitations.imTool' loaded Image Markup Tool 'MarkTaxonNames.imTool' loaded Image Markup Tool 'TableAnnotCleaner' loaded Image Markup Tool 'RemoveDuplicateAnnots' loaded Image Markup Tool 'TreatmentTaggerStyled.imTool' loaded Image Markup Tool 'ExtractMaterialsCitations.imTool' loaded Image Markup Tool 'CheckAnnotNesting' loaded Processing document 'E:\diglib\zootaxa\temp\zootaxa.4420.2.1.pdf'

D:\GoldenGateImagine20170823>java -jar -Xmx10240m GgImagineBatch.jar "DATA=E:\diglib\zootaxa\temp" DT=D CACHE=./BatchCache FM=U Loading parameters GoldenGATE Imagine core created, configuration is Default.imagine Image Markup Tool 'StructureDetector' loaded Image Markup Tool 'MetaDataAdder' loaded Image Markup Tool 'KeyHandler' loaded Image Markup Tool 'ParseBibliography.imTool' loaded Image Markup Tool 'MarkBibRefCitations.imTool' loaded Image Markup Tool 'MarkTaxonNames.imTool' loaded Image Markup Tool 'TableAnnotCleaner' loaded Image Markup Tool 'RemoveDuplicateAnnots' loaded Image Markup Tool 'TreatmentTaggerStyled.imTool' loaded Image Markup Tool 'ExtractMaterialsCitations.imTool' loaded Image Markup Tool 'CheckAnnotNesting' loaded Processing document 'E:\diglib\zootaxa\temp\zootaxa.4420.2.1.pdf'

D:\GoldenGateImagine20170823> zootaxa.4420.2.1.pdf

gsautter commented 6 years ago

OK, this tells me we have some error in the materials citation parsing stage of the batch ... even though I didn't change anything there in the last build, and neither in the preceding builds.

Any chance the error log might shed some light? I do see materials citation extraction breaks at some early point, but unfortunately without any hints as to the why.

gsautter commented 6 years ago

I you have uploaded the IMFs without the materials citations to the server, an IMF UUID would vastly speed up my investigation into this issue.

gsautter commented 6 years ago

Also, does the batch just terminate (suggesting some explicit and error log resident execution problem), or does it seem to hang and just take up a lot of CPU cycles?

Would help a great deal with resolving this issue ...

myrmoteras commented 6 years ago

I don't have an IMF for any of these articles. I make the PDF accessible to you so you can run some as batch on your machine. Then we see whether or not they work on your machine and you have the error log. When I run the batch the log is typically to big to export.

gsautter commented 6 years ago

Regarding uploading the IMF, I think I've explained before how to load a partially processed document from the batch cache and put it on the server. Was just asking, though ...

gsautter commented 6 years ago

Regarding the error log, if something goes wrong in a programming kind of sense, especially on re-running a batch processing job after killing it off in the Task Manager, the batch should break down pretty much immediately, leaving a log file that should be small enough to handle.

On the other hand, if any step left in the batch to handle after a restart runs into a prohibitive combinatoric explosion with some regular expression pattern, that would both explain the prohibitive log file size and hint yet another somewhat pathological case of data to consider ... hence the question about whether batch processing just broke down, or whether you killed it after several (or tens of) minutes of apparently going nowhere. Which one is it?

myrmoteras commented 6 years ago

This is the log after running the ten files, of which not all processed. So I can't give you an answer. The log does is for a batch and not a single file.

If the batch runs in a loop, then I need to kill the process. In this new situation, files are just not processed without killing the entire batch

gsautter commented 6 years ago

Well, if you run the batch with DATA=E:\diglib\zootaxa\temp\zootaxa.4420.2.1.pdf, you will definitely get the result and error log for a single PDF ... Please do so and help me shed some light on what the problem might be.