Add an option to not encode sentencepiece during training/decoding al…

XapaJIaMnu commented 1 year ago

Description

This PR adds the ability to train or decode with an a sentence that already has had spm_encode --model model.spm applied to it.

The benefit of this is that we can apply spm modifications prior to feeding the data to marian, giving us more flexibility than what SPM allows.

The code is minimally intrusive and doesn't change the behavior unless the flag is toggled on.

How to test

$ echo "Die Liste der Partien der Schachweltmeisterschaft 1986 führt sämtliche Partien auf, die beim Wettkampf um den Weltmeistertitel im Schach zwischen dem seit 1985 amtierenden Weltmeister Garri Kasparow und dem Herausforderer Anatoli Karpow (beide Sowjetunion) gespielt wurden." | ~/marian-dev/build/spm_encode --model vocab.deen.spm |  ~/marian-dev/build/marian-decoder -c model.npz.best-bleu.npz.decoder.yml --mini-batch 1 --maxi-batch 1 --quiet --quiet-translation --no-spm-encode
The list of World Chess Championship games 1986 lists all the games played in the competition for the World Chess Championship title between the world champion Garri Kasparov and the challenger Anatoly Karpow (both of the Soviet Union) since 1985.

$ echo "Die Liste der Partien der Schachweltmeisterschaft 1986 führt sämtliche Partien auf, die beim Wettkampf um den Weltmeistertitel im Schach zwischen dem seit 1985 amtierenden Weltmeister Garri Kasparow und dem Herausforderer Anatoli Karpow (beide Sowjetunion) gespielt wurden." |  ~/marian-dev/build/marian-decoder -c model.npz.best-bleu.npz.decoder.yml --mini-batch 1 --maxi-batch 1 --quiet --quiet-translation
The list of World Chess Championship games 1986 lists all the games played in the competition for the World Chess Championship title between the world champion Garri Kasparov and the challenger Anatoly Karpow (both of the Soviet Union) since 1985.

Checklist

[x] I have tested the code manually
[x] I have run regression tests
[x] I have read and followed CONTRIBUTING.md
[x] I have updated CHANGELOG.md

graemenail commented 1 year ago

If we prefer to produce output in SPM pieces then we could use this for mapping

jelmervdl commented 1 year ago

I was initially of the opinion that token ids would be safer, but looking at how byte fallback pieces look, I'd say pieces are fine. Maybe even better because they're somewhat human readable and you can see what's going on.

>>> spm.encode('🤣', out_type=str)
['▁', '<0xF0>', '<0x9F>', '<0xA4>', '<0xA3>']
>>> spm.encode('🤣', out_type=int)
[275, 247, 166, 171, 170]

XapaJIaMnu commented 1 year ago

Updated to use spm pieces as opposed to spm vocab ids so that the input can also be somewhat human readable.

ZJaume commented 1 year ago

Careful, in SP models without byte fallback, unknown characters are left as they are, instead of using unk token, when tokenizing into pieces:

>>> spm.encode('ç', out_type=int)
[25, 0]
>>> spm.encode('ç', out_type=str)
['▁', 'ç']

emjotde commented 1 year ago

There is basically already a way to do that. If you use spm_export --model bla.spm | cut -f 1 -d ' ' > bla.txt you can just do marian -v bla.txt bla.txt -t src.spm_tokenized.txt tgt.spm_tokenized.txt

XapaJIaMnu commented 1 year ago

Hi, so the goal of this change is to allow for training to be done on SPM-d corpora, while translation/validation still happens on de-SPM-d corpora so you can get accurate BLEU scores.

It also brings parity to the -no-spm-decode option. Technically both of those could be achieved by transforming the spm vocabulary into simple vocabulary, but we have an option for -no-spm-decode, yet no option for -no-spm-encode

XapaJIaMnu commented 1 year ago

@ZJaume , I just tested with sentencepiece and vocab that doesn't have unicode backoff, and it seems that it does indeed encode pass through unks:

$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm 
▁en g lish ▁text ▁ бг ▁ текст ▁ 靐
$ cat test.bg 
english text бг текст  靐

$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm  --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --no-spm-encode --quiet --quiet-translation
texto english
$ cat test.bg | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm  --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --quiet --quiet-translation
texto english

I also looked at the source and spm_->PieceToId() which we use to produce the vocabID can generate unks, as seen here: https://github.com/marian-nmt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp#L50

In light of this, I think this is ready to merge.

snukky commented 1 year ago

@XapaJIaMnu Will you resolve the conflicts (seem simple) and update the patch number in the VERSION file, or would you prefer me to do that? I can then merge.

XapaJIaMnu commented 1 year ago

I think i fixed it @snukky .

marian-nmt / marian-dev