Fixing DocumentReader to work with skip_finished in parallel

tuetschek commented 8 years ago

In parallel processing, using skip_finished could lead to unpredictable results as each of the workers creates its own list of files to process. The lists will not be identical if some of the workers start earlier and finish some files before other workers start. This leads to some files being processed multiple times (with clashes on the output, making it unreadable), and some files not being processed at all.

This is an attempt to fix this by overriding the file lists in parallel processing, replacing the list with just the single file to be processed next each time.

I tested this in a few simple cases (including extracting vectors from 400k files in 100 jobs) and it seems to work well. If you have a specific tricky scenario to test it before merging, please let me know.

(Sorry about Block::Write::YAML appearing here, I commited it to master as well but it somehow got a different hash)

martinpopel commented 8 years ago

Thanks.

a specific tricky scenario

What about aligned readers? They don't have $self->from. I hope that it will work as well because in parallel processing, the jobs should always read treex files, but I don't have time now to check it and test it.

tuetschek commented 8 years ago

OK, aligned readers don't work. I'll try to find out how they work.

tuetschek commented 8 years ago

I've added a fix – now it should work for both aligned and normal readers.

martinpopel commented 8 years ago

Thank you

ufal / treex

Fixing DocumentReader to work with skip_finished in parallel #33