Closed tuetschek closed 8 years ago
Thanks.
a specific tricky scenario
What about aligned readers? They don't have $self->from
.
I hope that it will work as well because in parallel processing, the jobs should always read treex files, but I don't have time now to check it and test it.
OK, aligned readers don't work. I'll try to find out how they work.
I've added a fix – now it should work for both aligned and normal readers.
Thank you
In parallel processing, using
skip_finished
could lead to unpredictable results as each of the workers creates its own list of files to process. The lists will not be identical if some of the workers start earlier and finish some files before other workers start. This leads to some files being processed multiple times (with clashes on the output, making it unreadable), and some files not being processed at all.This is an attempt to fix this by overriding the file lists in parallel processing, replacing the list with just the single file to be processed next each time.
I tested this in a few simple cases (including extracting vectors from 400k files in 100 jobs) and it seems to work well. If you have a specific tricky scenario to test it before merging, please let me know.
(Sorry about
Block::Write::YAML
appearing here, I commited it to master as well but it somehow got a different hash)