marian_decoder invocation with sentencepiece

marian-nmt / marian

Fast Neural Machine Translation in C++

https://marian-nmt.github.io

Other

1.22k stars 228 forks source link

marian_decoder invocation with sentencepiece #329

Closed sshleifer closed 4 years ago

sshleifer commented 4 years ago

cmake .. -DCOMPILE_CUDA=off -DUSE_SENTENCEPIECE=on
make -j4

Fetch Opus-NMT model en-de model:

wget https://object.pouta.csc.fi/OPUS-MT-models/en-de/opus-2020-02-26.zip
unzip opus-2020-02-26.zip -d en-de

If I pass the default vocab.yaml file (as decoder.yaml suggests) without sentencepiece, I get a strange translation:

export MD="en-de"
export vpath=$MD/opus.spm32k-spm32k.vocab.yml
export mpath=$MD/opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz

build/marian-decoder -m  $mpath \
    -v $vpath $vpath <<< "I am a small frog"

Results: [2020-04-20 08:47:31] Best translation 0 : ▁I am a klein

If I pass and source.spm and target.spm, I also get bad results:

build/marian-decoder -m  $mpath -v en-de/source.spm en-de/target.spm <<< "I am a small frog"

Results: [2020-04-20 08:52:09] Best translation 0 : was Reisende

Is this a model shortcoming or is my invocation of marian-decoder incorrect?

emjotde commented 4 years ago

Looks like they run a whole set of preprocessing in preprocess.sh and then postprocessing in postprocess.sh on the output. If you don't run that you will get vocabulary mismatches.

emjotde commented 4 years ago

This is really a question for the OPUS people not for us. It's their configuration, we are not affiliated.

Taking a few more looks at that, they have somewhat complicated settings there concerning vocabulary etc. I wish they would have consulted us before releasing that many models with settings like that.

What I can piece together: the only working invocation is the first one with the *.yml vocab, but you have to run preprocess.sh which defies the purpose of having SentencePiece vocabs in the first place.

Since they then went on to generate the vocab from the segmented text instead of using the binary (and included?) SentencePiece the ids don't match and the *.spm can only be used in external preprocessing. The target.spm isn't even involved unless you want to segment target text externally for instance for validation with marian-scorer. A bit of a mess unfortunately.

emjotde commented 4 years ago

BTW, @sshleifer if you wonder why I deleted your comment in the other issue, I wanted the mention here to go away since they are not related. But, oh well, apparently that's not how it works.

sshleifer commented 4 years ago

That works, thanks. I get a good translation if I run ./preprocess.sh before and then only use the .yaml vocabs.