marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.25k stars 233 forks source link

R2L Rescoring #33

Closed mehmedes closed 7 years ago

mehmedes commented 7 years ago

Would it be possible to perform R2L rescoring with amunmt? How could I integrate amunmt into Rico's r2l translation script?

#!/bin/bash

# this sample script translates a test set, including
# preprocessing (tokenization, truecasing, and subword segmentation),
# and postprocessing (merging subword units, detruecasing, detokenization).

# instructions: set paths to mosesdecoder, subword_nmt, and nematus,
# then run "./translate.sh < input_file > output_file"

# suffix of source language
SRC=en

# suffix of target language
TRG=de

# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=/home/sariyildiznureddin/mosesdecoder

# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=/home/sariyildiznureddin/subword-nmt

# path to nematus ( https://www.github.com/rsennrich/nematus )
nematus=/home/sariyildiznureddin/nematus

# theano device
device=cpu

# temporary file (needed for r2l rescoring)
tmpfile=`mktemp`

# preprocess
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC -penn | \
$mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \
$subword_nmt/apply_bpe.py -c $SRC$TRG.bpe > $tmpfile
# translate
cat $tmpfile | THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/translate.py \
     -m model-ens1.npz model-ens2.npz model-ens3.npz model-ens4.npz \
     -k 50 -n -p 1 --n-best --suppress-unk | \
# reverse
python r2l/reverse_nbest.py | \
# rescore with r2l model
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/rescore.py \
      -m r2l/model-ens1.npz r2l/model-ens2.npz r2l/model-ens3.npz r2l/model-ens4.npz -s $tmpfile -b 80 -n | \
python r2l/rerank.py | \
# restore original word order
python r2l/reverse.py | \
# postprocess
sed 's/\@\@ //g' | \
$mosesdecoder/scripts/recaser/detruecase.perl | \
$mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG

rm $tmpfile

l2r worked well:

#!/bin/sh

# this sample script translates a test set, including
# preprocessing (tokenization, truecasing, and subword segmentation),
# and postprocessing (merging subword units, detruecasing, detokenization).

# instructions: set paths to mosesdecoder, subword_nmt, and nematus,
# then run "./translate.sh < input_file > output_file"

# suffix of source language
SRC=en

# suffix of target language
TRG=de

# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=/home/sariyildiznureddin/mosesdecoder

# preprocess
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC | \
$mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \

# translate
/home/sariyildiznureddin/amunmt/build/bin/amun -c /home/sariyildiznureddin/amunmt/build/bin/config.ens.yml | \

sed 's/\@\@ //g' | \
$mosesdecoder/scripts/recaser/detruecase.perl | \
$mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG -penn
emjotde commented 7 years ago

We had a rescorer once, but it disappeared due to lack of support and interest. Bringing it back is possible in principle, but I do not think anyone will currently work on this, sorry.

emjotde commented 7 years ago

On a second thought, the rescorer might be brought back, once we integrate our training pipeline. Then it will be a lot easier, as the forward step for training is basically rescoring. Might however take a while.

mehmedes commented 7 years ago

Ok. Thanks!