Closed XapaJIaMnu closed 1 year ago
If we prefer to produce output in SPM pieces then we could use this for mapping
I was initially of the opinion that token ids would be safer, but looking at how byte fallback pieces look, I'd say pieces are fine. Maybe even better because they're somewhat human readable and you can see what's going on.
>>> spm.encode('🤣', out_type=str)
['▁', '<0xF0>', '<0x9F>', '<0xA4>', '<0xA3>']
>>> spm.encode('🤣', out_type=int)
[275, 247, 166, 171, 170]
Updated to use spm pieces as opposed to spm vocab ids so that the input can also be somewhat human readable.
Careful, in SP models without byte fallback, unknown characters are left as they are, instead of using unk token, when tokenizing into pieces:
>>> spm.encode('ç', out_type=int)
[25, 0]
>>> spm.encode('ç', out_type=str)
['▁', 'ç']
There is basically already a way to do that. If you use spm_export --model bla.spm | cut -f 1 -d ' ' > bla.txt
you can just do marian -v bla.txt bla.txt -t src.spm_tokenized.txt tgt.spm_tokenized.txt
Hi, so the goal of this change is to allow for training to be done on SPM-d corpora, while translation/validation still happens on de-SPM-d corpora so you can get accurate BLEU scores.
It also brings parity to the -no-spm-decode
option. Technically both of those could be achieved by transforming the spm vocabulary into simple vocabulary, but we have an option for -no-spm-decode
, yet no option for -no-spm-encode
@ZJaume , I just tested with sentencepiece and vocab that doesn't have unicode backoff, and it seems that it does indeed encode pass through unks:
$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm
▁en g lish ▁text ▁ бг ▁ текст ▁ 靐
$ cat test.bg
english text бг текст 靐
$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --no-spm-encode --quiet --quiet-translation
texto english
$ cat test.bg | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --quiet --quiet-translation
texto english
I also looked at the source and spm_->PieceToId()
which we use to produce the vocabID can generate unks, as seen here: https://github.com/marian-nmt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp#L50
In light of this, I think this is ready to merge.
@XapaJIaMnu Will you resolve the conflicts (seem simple) and update the patch number in the VERSION file, or would you prefer me to do that? I can then merge.
I think i fixed it @snukky .
Description
This PR adds the ability to train or decode with an a sentence that already has had
spm_encode --model model.spm
applied to it.The benefit of this is that we can apply spm modifications prior to feeding the data to marian, giving us more flexibility than what SPM allows.
The code is minimally intrusive and doesn't change the behavior unless the flag is toggled on.
How to test
Checklist