joshua-decoder / joshua

Joshua Statistical Machine Translation Toolkit
http://joshua-decoder.org/
121 stars 56 forks source link

Parallel aligners #202

Closed lukeorland closed 9 years ago

lukeorland commented 9 years ago

(quoting @mjpost )

For alignment, the data is split into chunks of $ALIGNER_BLOCKSIZE, and then the aligner (GIZA++ or Berkeley) is run on each of them separately, in sequence. Since this is a very time-consuming process, this is hugely wasteful. It would be really nice if the alignments steps, which are embarrassingly parallel, ran in parallel, up to $NUM_THREADS.

There is some code to use a pool, and some aborted code in there that uses qsub, but I don't want to use multi-node parallelism, just multi-threading. The problem with the thread pool (see commented-out code around line 803 in pipeline.pl) is that each thread has to call system() to actually run the aligner, and these don't seem to be allowed in parallel (at least, that was my diagnosis). There has to be some way to run these in parallel, however. You could look at Moses and steal its scripts, if needed. In particular, I think a fork() model (instead of thread pools) could work well. See $MOSES/scripts/ems/support/generic-multicore-parallelizer.perl.

https://trello.com/c/Cv4UQjLM/64-parallelize-alignment

lukeorland commented 9 years ago

Once/if this is proven to work, I'll strip out all the commented lines that were the plans for this change, and push up another commit.

lukeorland commented 9 years ago

I'm probably not using the most idomatic perl; feel free to make suggestions.

mjpost commented 9 years ago

Okay, I'm testing this now.

hieuhoang commented 9 years ago

do you guys use mgiza or fast align? mgiza is multi-threaded, and fast align is err fast. I heard it's gonna be even faster soon with multithreading support

mjpost commented 9 years ago

@hieuhoang, we have options for GIZA++ or the Berkeley aligner. We don't use mgiza, but instead split the corpus into blocks and align them independently. This isn't as good as mgiza (which I think just parallelizes the E step?), and I haven't but should test the comparison, but on the other hand it works for any aligner we might want to use.

I've found better performance with Berkeley, particularly for noisy text and low-resource languages. I've been meaning to import fast_align but haven't got around to it. It would be good to compare it to Berkeley.

mjpost commented 9 years ago

Tested this and it works great, thanks, @lukeorland.