Open myrmoteras opened 4 years ago
Thank you for the spot-on error description ... the log file breaks off somewhere in the middle of bibliography tagging, so I take it the PDF does decode just fine, and it's merely the batch that apparently hangs somewhere in the middle.
What happens when you restart the batch? It should pick up right where the first run got disrupted, i.e.,, with identifying the bibliography.
no, it stops exactly at the same place GgImagineBatch.20200310-1456.out.zip
The log says otherwise ... it's a different reference the log breaks off at. Looking at the PDF, I see the bibliography is 180 (!!!) pages long, so you might simply have to let this run for an hour or two ... RefParse is tuned for accuracy even on small bibliographies, so all the double checking might take a bit of a toll on performance with this behemoth ...
ok, if you say so, I'll let it run and report tomorrow
Well, 180 pages of bibliography with most likely several thousand (or even around 10,000) references are simply no joke ... and surely not what RefParse is primarily designed to handle, so we'll have to sit and wait this one out overnight ... sorry.
@gsautter this seems to be the perfect candidate to run on our super machine in Frankfurt, and one more hint, that we should really start processing in this facility. Can you give it a try? You have the file.
@gsautter still working, the log file is now 1.3GB and can't be opened. Shall I let it run?
Thinking about this, and having spend our funds in building a super server, we need to justfiy the funds and move template based production to Frankfurt. https://github.com/plazi/arcadia-project/issues/140
I'm already running it in Frankfurt ... started it yesterday, but had to restart the back-end since for some other reasons. Running again now, in Frankfurt.
ok - interested to see what you are getting. Since to be a standard to run again to compare speed issues, the data we get out etc.
So I won't process todays article to not jeopardize this event
@gsautter so it crashed after ca 36 hours or processing. The log file is 1.8GB and I can't open this. What's next?
Still running on the server ... I think I'll increase both the log level and the memory limit and start over ... we'll find out where this thing hangs.
After adding some more logging functionality to the server side batch and running this document there, I was finally able to at least figure out the culprit: Apparently, the bibliographic citation tagger hangs at some point. That also means that the steps before it went through, i.e., up to bibliography parsing. The steps take a while, for sure, but that's somehow expected with a 180 page bibliography.
I'll focus on the citation tagger now and try and figure out why that thing appears to stop dead in its tracks at some point.
More surprises ... the author failed to extract via the template because it's set in font size 10 rather than the expected 11-12 ... we'll have to observe whether or not this is a permanent change, and adjust the template if so.
I just think this is an exceptional issue...
In my effort to run this one step by step on the server (it is very possible to have the batch processor use individual tools instead of the whole sequence), I've got the bibliography marked and parsed now ... for a total of 5,762 references ... Will check if it marked the actual bibliography, though, as we've had previous cases of some long, sorted list of binomials throwing off bibliography detection ...
Turns out the bibliography was detected correctly ... it's simply this ginormous number of bibliographic references ... Now, we'll see where the bibliographic citation tagger grinds to a halt ...
Turns out the citation tagger doesn't really seem to grind to a halt after all ... just takes forever with that ginormous bibliography and all the respective citations ... I'll see if I find a few screws to turn that will make it faster ...
Looks as though the modifications I implemented this morning in the bibliographic citation tagger (caching of reference details, especially authors and year) indeed solved the problem ... proceeding to taxonomic names now.
zootaxa.4749.1.1.pdf
the log file is too big to upload here, so the link to GD. https://drive.google.com/open?id=1wXKRIklFH-mRJVIieAK4rYsdsZ7UnHmw