Open myrmoteras opened 5 years ago
running the batch parallel, I am getting this
The problematic part is not the number of paragraphs that contain MCs, but their individual size ... what takes most of the time is the chunking, i.e., finding the boundaries between individual MCs. And 4452 tokens is a really huge paragraph, I'd tend to think one spanning at least 3 pages.
The only way of handling that, however, is to let it run, I'm afraid. This is just by far the most complex part of the whole batch, and getting the results we have been getting just takes time. Whether or not is has stalled you can tell by CPU load (in Task Manager): As long as there is something running full throttle on at least one core, it's most likely still processing.
The error on the parallel batch is a system thing, I think. If I remember correctly, you have 16 GB of RAM in your desktop machine, and you usually give 10 GB to both GGI and the batch, so that doesn't add up. Would be a different story if you gave only 6 GB to each of them, as that would add up and still leave enough memory for Windows proper.
does this indidcate that something is still running?
if so, then I will let it run until tomorrow morning.
That does indeed look like something is still processing, even though it's impossible to tell from Task Manager what exactly it is doing. Could you send me the log tomorrow morning? Despite all the considerable effort chunking up a large MC paragraph takes, it shouldn't be running for hours, and the logs would provide me with the information I need to find what exactly takes so long, and then go in and hopefully find some way of making it faster.
do I get a log if I kill the application, which I think I might have to do? Is there a way to restore what I have done so far (which is quiet some time)?
You do get a log, yes, it's written to continuously. Might be missing the last few lines, but these are not as important in this case - what has been going on since you started this will be visible anyway.
As to restoring what was done before you started the MCs, I'm afraid this might be difficult, sorry ... maybe better let it run over night, if only to see whether or not there will be any further progress at all. If we're lucky, it'll be finished in the morning and all is well.
Or you hit that "Abort" button so the MCs stop at least after this one paragraph and not run into the next and take similarly long. That will preserve all your work as well.
The abort button does not work.
You are right in that it doesn't work instantly ... however, the abort happens soon as whatever gizmo is running writes the next line into the splash screen. A bit counter intuitive, I know, but hard to do otherwise without a massively more complex implementation.
It is running for six hours now after I pressed abort
shoot ... let's hope it gets somewhere by morning. I definitely need to do something about this. Have some idea what might be taking this long, but there have to be a very specific constellations of various character combinations in the text for that to happen (unbalanced round or square brackets or really weird quoting, and the reverse problem further down the paragraph to apparently finally "close" what was left open). The logs will tell me.
I killed it, since it doesn't seem to change.
here is the second attempt.
I did not run materials citation, but just marked the holotypes. Can you please check the page numbering. this might be off, because the article has a coversheet
A83FBA520A5B2126C365FF9F2A4C4147
I'm looking into the materials citations in this one right now ... the "additional materials" of the "Kukulcania hibernalis (Hentz, 1842)" treatment are a single continuous paragraph running across 8 1/2 pages (!!!). The type material in "Kukulcania cochimi, sp. nov." is OK, as is the type material of "Kukulcania arizonica (Chamberlin and Ivie, 1935)", but the latter treatment again comes with 4 1/2 pages of continuous additional materials. "Kukulcania gertschi, sp. nov." is OK again, as is the type material of "Kukulcania utahana (Chamberlin and Ivie, 1935)", as well as the 1 1/2 pages of additional materials in the latter. "Kukulcania hurca (Chamberlin and Ivie, 1942)" includes 3 1/2 pages of continuous additional material, in addition to the reasonably sized type material. "Kukulcania brignolii (Alayón, 1981), comb. nov." has a moderate additional material section of a bit below a single page. "Kukulcania mexicana, sp. nov." is also fine, as is "Kukulcania santosi, sp. nov.". "Kukulcania tractans (O. Pickard-Cambridge, 1896)" also comes with reasonable amounts of materials. "Kukulcania tequila, sp. nov." is OK, too. "Kukulcania chingona, sp. nov" is very moderate in materials. "Kukulcania geophila (Chamberlin and Ivie, 1935)" comes with another semi-excessive 2 pages of additional material. "Kukulcania benita, sp. nov." is fine, as is "Kukulcania bajacali, sp. nov.". "Pikelinia brevipes, comb. nov. (Keyserling, 1883)" comes without any materials at all.
It's only the three extremely long continuous listings of materials citations in "Kukulcania hibernalis (Hentz, 1842)", "Kukulcania arizonica (Chamberlin and Ivie, 1935)", and "Kukulcania hurca (Chamberlin and Ivie, 1942)" that are causing the problems ... While I managed to keep combinatorics at bay now, the effort for handling a large single block of materials citations still grows considerably more than linear in the block's length. I'll try and figure out what exactly is (mainly) causing the problems, and maybe find a way of speeding up matters, but such excessive concatenations of materials citations are just no joke ... will see what I can do.
What kind of specimen count is "3 imm."? What specimen type is abbreviated "imm."? Easy enough to add, but I always feel better knowing what it means ...
Apart from that, I'm closing in on the problems:
Getting there, now switching to pursuing a solution for the country/region mix-ups - a single problematic collection code is more pathological than a real tagger problem compared to the former.
A hangup safeguard was added to the materials citation handler with the update of Dec 13th, 2019.
I am running a large file (250MB) with many MC. I am processing it, and it seems that it stalled.
What to do? Wait? If so, how long? Is there a way to see progress?
I unfortunately did not store the file before I started to MC processing. In case I have to abort, can I still recover it?