Closed sshleifer closed 4 years ago
Looks like they run a whole set of preprocessing in preprocess.sh
and then postprocessing in postprocess.sh
on the output. If you don't run that you will get vocabulary mismatches.
This is really a question for the OPUS people not for us. It's their configuration, we are not affiliated.
Taking a few more looks at that, they have somewhat complicated settings there concerning vocabulary etc. I wish they would have consulted us before releasing that many models with settings like that.
What I can piece together: the only working invocation is the first one with the *.yml vocab, but you have to run preprocess.sh which defies the purpose of having SentencePiece vocabs in the first place.
Since they then went on to generate the vocab from the segmented text instead of using the binary (and included?) SentencePiece the ids don't match and the *.spm can only be used in external preprocessing. The target.spm isn't even involved unless you want to segment target text externally for instance for validation with marian-scorer. A bit of a mess unfortunately.
BTW, @sshleifer if you wonder why I deleted your comment in the other issue, I wanted the mention here to go away since they are not related. But, oh well, apparently that's not how it works.
That works, thanks. I get a good translation if I run ./preprocess.sh
before and then only use the .yaml vocabs.
Fetch Opus-NMT model en-de model:
If I pass the default vocab.yaml file (as decoder.yaml suggests) without sentencepiece, I get a strange translation:
Results: [2020-04-20 08:47:31] Best translation 0 : ▁I am a klein
If I pass and
source.spm
andtarget.spm
, I also get bad results:Results: [2020-04-20 08:52:09] Best translation 0 : was Reisende
Is this a model shortcoming or is my invocation of marian-decoder incorrect?