MC hanging or not? how to decide

myrmoteras commented 5 years ago

I am running a large file (250MB) with many MC. I am processing it, and it seems that it stalled.

What to do? Wait? If so, how long? Is there a way to see progress?

I unfortunately did not store the file before I started to MC processing. In case I have to abort, can I still recover it?

myrmoteras commented 5 years ago

running the batch parallel, I am getting this

gsautter commented 5 years ago

The problematic part is not the number of paragraphs that contain MCs, but their individual size ... what takes most of the time is the chunking, i.e., finding the boundaries between individual MCs. And 4452 tokens is a really huge paragraph, I'd tend to think one spanning at least 3 pages.

The only way of handling that, however, is to let it run, I'm afraid. This is just by far the most complex part of the whole batch, and getting the results we have been getting just takes time. Whether or not is has stalled you can tell by CPU load (in Task Manager): As long as there is something running full throttle on at least one core, it's most likely still processing.

gsautter commented 5 years ago

The error on the parallel batch is a system thing, I think. If I remember correctly, you have 16 GB of RAM in your desktop machine, and you usually give 10 GB to both GGI and the batch, so that doesn't add up. Would be a different story if you gave only 6 GB to each of them, as that would add up and still leave enough memory for Windows proper.

myrmoteras commented 5 years ago

does this indidcate that something is still running?

if so, then I will let it run until tomorrow morning.

gsautter commented 5 years ago

That does indeed look like something is still processing, even though it's impossible to tell from Task Manager what exactly it is doing. Could you send me the log tomorrow morning? Despite all the considerable effort chunking up a large MC paragraph takes, it shouldn't be running for hours, and the logs would provide me with the information I need to find what exactly takes so long, and then go in and hopefully find some way of making it faster.

myrmoteras commented 5 years ago

do I get a log if I kill the application, which I think I might have to do? Is there a way to restore what I have done so far (which is quiet some time)?

gsautter commented 5 years ago

You do get a log, yes, it's written to continuously. Might be missing the last few lines, but these are not as important in this case - what has been going on since you started this will be visible anyway.

gsautter commented 5 years ago

As to restoring what was done before you started the MCs, I'm afraid this might be difficult, sorry ... maybe better let it run over night, if only to see whether or not there will be any further progress at all. If we're lucky, it'll be finished in the morning and all is well.

gsautter commented 5 years ago

Or you hit that "Abort" button so the MCs stop at least after this one paragraph and not run into the next and take similarly long. That will preserve all your work as well.

myrmoteras commented 5 years ago

The abort button does not work.

gsautter commented 5 years ago

You are right in that it doesn't work instantly ... however, the abort happens soon as whatever gizmo is running writes the next line into the splash screen. A bit counter intuitive, I know, but hard to do otherwise without a massively more complex implementation.

myrmoteras commented 5 years ago

It is running for six hours now after I pressed abort

gsautter commented 5 years ago

shoot ... let's hope it gets somewhere by morning. I definitely need to do something about this. Have some idea what might be taking this long, but there have to be a very specific constellations of various character combinations in the text for that to happen (unbalanced round or square brackets or really weird quoting, and the reverse problem further down the paragraph to apparently finally "close" what was left open). The logs will tell me.

myrmoteras commented 5 years ago

I killed it, since it doesn't seem to change.

myrmoteras commented 5 years ago

here is the second attempt.

I did not run materials citation, but just marked the holotypes. Can you please check the page numbering. this might be off, because the article has a coversheet

A83FBA520A5B2126C365FF9F2A4C4147

gsautter commented 5 years ago

I'm looking into the materials citations in this one right now ... the "additional materials" of the "Kukulcania hibernalis (Hentz, 1842)" treatment are a single continuous paragraph running across 8 1/2 pages (!!!). The type material in "Kukulcania cochimi, sp. nov." is OK, as is the type material of "Kukulcania arizonica (Chamberlin and Ivie, 1935)", but the latter treatment again comes with 4 1/2 pages of continuous additional materials. "Kukulcania gertschi, sp. nov." is OK again, as is the type material of "Kukulcania utahana (Chamberlin and Ivie, 1935)", as well as the 1 1/2 pages of additional materials in the latter. "Kukulcania hurca (Chamberlin and Ivie, 1942)" includes 3 1/2 pages of continuous additional material, in addition to the reasonably sized type material. "Kukulcania brignolii (Alayón, 1981), comb. nov." has a moderate additional material section of a bit below a single page. "Kukulcania mexicana, sp. nov." is also fine, as is "Kukulcania santosi, sp. nov.". "Kukulcania tractans (O. Pickard-Cambridge, 1896)" also comes with reasonable amounts of materials. "Kukulcania tequila, sp. nov." is OK, too. "Kukulcania chingona, sp. nov" is very moderate in materials. "Kukulcania geophila (Chamberlin and Ivie, 1935)" comes with another semi-excessive 2 pages of additional material. "Kukulcania benita, sp. nov." is fine, as is "Kukulcania bajacali, sp. nov.". "Pikelinia brevipes, comb. nov. (Keyserling, 1883)" comes without any materials at all.

gsautter commented 5 years ago

It's only the three extremely long continuous listings of materials citations in "Kukulcania hibernalis (Hentz, 1842)", "Kukulcania arizonica (Chamberlin and Ivie, 1935)", and "Kukulcania hurca (Chamberlin and Ivie, 1942)" that are causing the problems ... While I managed to keep combinatorics at bay now, the effort for handling a large single block of materials citations still grows considerably more than linear in the block's length. I'll try and figure out what exactly is (mainly) causing the problems, and maybe find a way of speeding up matters, but such excessive concatenations of materials citations are just no joke ... will see what I can do.

gsautter commented 5 years ago

What kind of specimen count is "3 imm."? What specimen type is abbreviated "imm."? Easy enough to add, but I always feel better knowing what it means ...

gsautter commented 5 years ago

Apart from that, I'm closing in on the problems:

Where the days in collecting dates are missing, the date tagger dragged in the leading digits of the subsequent specimen counts instead, incurring overlap (fixed now).
A good few regions are mistaken for countries (e.g. "Formosa", aka Taiwan, but here a province of Argentina), incurring duplicate countries (there are a few regions as well, e.g. "Buenos Aires", which names both a province of Argentina and its capital), which foils country grouping and thus chunking up the very large paragraphs ... after manually fixing countries and regions, processing is actually quite fast thanks to the chunking.
"MACN-Ar" fails to tag as a collection code, foiling recognition of a large number of specimen codes along the way, which also gets into the way of (a later stage of) chunking. After manually annotation all the occurrences of "MACN-Ar" as collection codes, that part is out of the way as well.

Getting there, now switching to pursuing a solution for the country/region mix-ups - a single problematic collection code is more pathological than a real tagger problem compared to the former.

gsautter commented 4 years ago

A hangup safeguard was added to the materials citation handler with the update of Dec 13th, 2019.

gsautter / goldengate-imagine

MC hanging or not? how to decide #693