Closed j0ma closed 2 years ago
Come to think of it, the output vocab size of EN-RU is probably related to #3
./data-bin/en-ru/en_sp32k_ru_sp32k/default-train/preprocess.log:[ru] Dictionary: 148016 types
This is related to sentencepiece where unknown words are piped through unprocessed when not using IDs as the output format. Seems like someone had noticed this on Twitter and Matt Post filed an issue as well.
To get around this, set byte_fallback=True
in the sentencepiece trainer code.
As a sanity check, we can try to segment a Russian word using a SP1k IU model:
Before:
In [6]: sp = spm.SentencePieceProcessor(model_file="iu_sp1k.bin")
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
In [7]: sp.encode("погибли", out_type=str)
Out[7]: ['▁', 'погибли']
After making the change to byte_fallback=True
, the model segments the OOV word into bytes:
In [8]: print(sp.encode("погибли", out_type=str))
['▁', '<0xD0>', '<0xBF>', '<0xD0>', '<0xBE>', '<0xD0>', '<0xB3>', '<0xD0>', '<0xB8>', '<0xD0>', '<0xB1>', '<0xD0>', '<0xBB>', '<0xD0>', '<0xB8>']
Why are <0xD0>
inserted though?
More sanity checking:
In [1]: import sentencepiece as spm
In [2]: sp = spm.SentencePieceProcessor(model_file="iu_sp1k.bin")
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
In [3]: iku = "ᐃᓄᒃᑎᑐᑦ"
In [4]: sp.encode(iku, out_type=str)
Out[4]: ['▁ᐃᓄᒃ', 'ᑎ', 'ᑐ', 'ᑦ']
In [5]: sp.encode(iku, out_type=int)
Out[5]: [750, 275, 335, 261]
In [6]: cyr = "погибли"
In [7]: sp.encode(cyr, out_type=str)
Out[7]:
['▁',
'<0xD0>',
'<0xBF>',
'<0xD0>',
'<0xBE>',
'<0xD0>',
'<0xB3>',
'<0xD0>',
'<0xB8>',
'<0xD0>',
'<0xB1>',
'<0xD0>',
'<0xBB>',
'<0xD0>',
'<0xB8>']
In [8]: sp.encode(cyr, out_type=int)
Out[8]: [266, 211, 194, 211, 193, 211, 182, 211, 187, 211, 180, 211, 190, 211, 187]
In [12]: heb = "יִשְׂרָאֵל"
In [13]: sp.encode(heb, out_type=str)
Out[13]:
['▁',
'<0xD7>',
'<0x99>',
'<0xD6>',
'<0xB4>',
'<0xD7>',
'<0xA9>',
'<0xD6>',
'<0xB0>',
'<0xD7>',
'<0x82>',
'<0xD7>',
'<0xA8>',
'<0xD6>',
'<0xB8>',
'<0xD7>',
'<0x90>',
'<0xD6>',
'<0xB5>',
'<0xD7>',
'<0x9C>']
In [14]: sp.encode(heb, out_type=int)
Out[14]:
[266,
218,
156,
217,
183,
218,
172,
217,
179,
218,
133,
218,
171,
217,
187,
218,
147,
217,
184,
218,
159]
Non-Latin alphabets that aren't included in the training data seem to cause extra bytes to be inserted. Doesn't happen for Latin alphabet as those characters are included in training data.
After re-processing using byte_fallback=True
, the vocab sizes seem much more sane now:
./data-bin/en-uz/en_sp4k_uz_sp4k/default-train/preprocess.log
[en] Dictionary: 3744 types
[en] Dictionary: 3744 types
[en] Dictionary: 3744 types
[uz] Dictionary: 3744 types
[uz] Dictionary: 3744 types
[uz] Dictionary: 3744 types
./data-bin/en-tr/en_sp32k_tr_sp32k/default-train/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[tr] Dictionary: 31848 types
[tr] Dictionary: 31848 types
[tr] Dictionary: 31848 types
./data-bin/en-ru/en_sp32k_ru_sp32k/wmt-18-20/preprocess.log
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types
./data-bin/en-iu/en_sp1k_iu_sp1k/wmt20/preprocess.log
[en] Dictionary: 760 types
[en] Dictionary: 760 types
[iu] Dictionary: 752 types
[iu] Dictionary: 752 types
./data-bin/en-iu/en_sp1k_iu_sp1k/hansard/preprocess.log
[en] Dictionary: 760 types
[en] Dictionary: 760 types
[en] Dictionary: 760 types
[iu] Dictionary: 752 types
[iu] Dictionary: 752 types
[iu] Dictionary: 752 types
./data-bin/en-fi/en_sp32k_fi_sp32k/newstest-2019/preprocess.log
[en] Dictionary: 31848 types
[fi] Dictionary: 31848 types
./data-bin/en-uz/en_sp1k_uz_sp1k/default-train/preprocess.log
[en] Dictionary: 752 types
[en] Dictionary: 752 types
[en] Dictionary: 752 types
[uz] Dictionary: 744 types
[uz] Dictionary: 744 types
[uz] Dictionary: 744 types
./data-bin/en-fi/en_sp32k_fi_sp32k/newstest-2018/preprocess.log
[en] Dictionary: 31848 types
[fi] Dictionary: 31848 types
./data-bin/en-fi/en_sp32k_fi_sp32k/default-train/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[fi] Dictionary: 31848 types
[fi] Dictionary: 31848 types
./data-bin/en-et/en_sp32k_et_sp32k/default-train/preprocess.log
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[et] Dictionary: 31848 types
[et] Dictionary: 31848 types
[et] Dictionary: 31848 types
./data-bin/en-de/en_sp32k_de_sp32k/wmt-late/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types
./data-bin/en-de/en_sp32k_de_sp32k/wmt-early/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types
./data-bin/en-ru/en_sp32k_ru_sp32k/default-train/preprocess.log
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types
./data-bin/en-de/en_sp32k_de_sp32k/default-train/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types
./data-bin/en-cs/en_sp32k_cs_sp32k/wmt-late/preprocess.log
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types
./data-bin/en-cs/en_sp32k_cs_sp32k/wmt-early/preprocess.log
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types
./data-bin/en-cs/en_sp32k_cs_sp32k/default-train/preprocess.log
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types
Just leaving a note here that applying mBARTs's sentence.bpe.model
to IU (which it was not trained on) gives the following
[en] Dictionary: 250001 types
[en] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.train.en: 1293439 sents, 26237836 tokens, 5.72e-05% replaced by <unk>
[en] Dictionary: 250001 types
[en] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.dev.en: 2674 sents, 77912 tokens, 0.0% replaced by <unk>
[en] Dictionary: 250001 types
[en] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.test.en: 3602 sents, 104534 tokens, 0.0% replaced by <unk>
[iu] Dictionary: 250001 types
[iu] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.train.iu: 1293439 sents, 19427065 tokens, 39.2% replaced by <unk>
[iu] Dictionary: 250001 types
[iu] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.dev.iu: 2674 sents, 56301 tokens, 41.1% replaced by <unk>
[iu] Dictionary: 250001 types
[iu] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.test.iu: 3602 sents, 80432 tokens, 41.4% replaced by <unk>
Wrote preprocessed data to data-bin/en-iu/en_spmbart_iu_spmbart/hansard
At first I thought the vocab was too big. Then I applied the model to Estonian, and got:
2022-01-25 15:03:17 | INFO | fairseq_cli.preprocess | [en] Dictionary: 250001 types
2022-01-25 15:04:41 | INFO | fairseq_cli.preprocess | [en] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.train.en: 13528733 sents, 316437843 tokens, 0.00424% replaced by <unk>
2022-01-25 15:04:41 | INFO | fairseq_cli.preprocess | [en] Dictionary: 250001 types
2022-01-25 15:04:45 | INFO | fairseq_cli.preprocess | [en] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.dev.en: 2000 sents, 54722 tokens, 0.0% replaced by <unk>
2022-01-25 15:04:45 | INFO | fairseq_cli.preprocess | [en] Dictionary: 250001 types
2022-01-25 15:04:50 | INFO | fairseq_cli.preprocess | [en] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.test.en: 2000 sents, 58393 tokens, 0.0% replaced by <unk>
2022-01-25 15:04:50 | INFO | fairseq_cli.preprocess | [et] Dictionary: 250001 types
2022-01-25 15:06:20 | INFO | fairseq_cli.preprocess | [et] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.train.et: 13528733 sents, 348604116 tokens, 0.00364% replaced by <unk>
2022-01-25 15:06:20 | INFO | fairseq_cli.preprocess | [et] Dictionary: 250001 types
2022-01-25 15:06:25 | INFO | fairseq_cli.preprocess | [et] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.dev.et: 2000 sents, 59090 tokens, 0.0% replaced by <unk>
2022-01-25 15:06:25 | INFO | fairseq_cli.preprocess | [et] Dictionary: 250001 types
2022-01-25 15:06:29 | INFO | fairseq_cli.preprocess | [et] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.test.et: 2000 sents, 62865 tokens, 0.0% replaced by <unk>
2022-01-25 15:06:29 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to /home/jonne/mrl_nmt22/data-bin/en-et/en_spmbart_et_spmbart/default-train
Both have ~250k vocab size which matches the mBART paper:
However, note how high the UNK replacement rate is in dev / test for IU.
Looking at how IU gets segmented, it's clear that nothing is getting segmented properly:
(torch-rtx-3090) jonne@lignos07:~/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/wmt20$ paste en-iu.dev{.detok.iu,.iu} | tail -n 20 | head -n 1 | tr '\t' '\n'
"ᒥᓂᔅᑕ ᔪᐊᓇᓯ ᑎᑎᕋᓚᐅᖅᑐᖅ ᓴᖅᑭᖅᑕᐅᓪᓗᑎᒃ ᓄᓇᑦᓯᐊᖅ ᓅᔅᑯᓐᓄᑦ ᓅᕙᐃᒻᕙ 25, 2019 ᐅᖃᐅᓯᓕᓐᓂᒃ ᐅᖃᖃᑎᒌᖃᑦᑕᕐᓯᒪᓂᖏᓐᓂᒃ ᐃᓕᓐᓂᐊᕐᓂᓕᕆᔨᒃᑯᑦ ᐱᓕᕆᒡᕕᐊ ᐊᒻᒪᓗ ᓄᓇᕗᑦ ᑐᙵᕕᒃ ᑎᒥᖓ ᐱᖁᔭᒃᓴᖅ 25 ᐱᔾᔪᑎᒋᓪᓗᒍ, ᓇᓗᓇᐃᕐᓯᓪᓗᓂ ᑖᒃᑯᐊ ᐅᖃᖃᑎᒌᓐᓂᕆᖃᑦᑕᓚᐅᖅᑕᖏᑦ ᐊᐅᓚᓂᖃᓚᐅᕐᓂᖏᓐᓂᒃ ᐅᑉᐱᕆᖃᑦᑕᐅᑎᓂᒃᑯᑦ ᐊᒻᒪᓗ ᒪᓂᒪᑎᑦᓯᕕᐅᓪᓗᑎᒃ ᑐᙵᕕᒃᑯᑦ ᐱᕕᒃᓴᖃᖅᑎᑕᐅᑦᓯᐊᖅᑐᑎᒃ ᐅᖃᐅᓯᖃᖁᓪᓗᒋᑦ ᐃᓱᒫᓘᑎᒥᓂᒃ."
▁" ᒥᓂᔅᑕ ▁ ᔪᐊᓇᓯ ▁ ᑎᑎᕋᓚᐅᖅᑐᖅ ▁ ᓴᖅᑭᖅᑕᐅᓪᓗᑎᒃ ▁ ᓄᓇᑦᓯᐊᖅ ▁ ᓅᔅᑯᓐᓄᑦ ▁ ᓅᕙᐃᒻᕙ ▁25 , ▁2019 ▁ ᐅᖃᐅᓯᓕᓐᓂᒃ ▁ ᐅᖃᖃᑎᒌᖃᑦᑕᕐᓯᒪᓂᖏᓐᓂᒃ ▁ ᐃᓕᓐᓂᐊᕐᓂᓕᕆᔨᒃᑯᑦ ▁ ᐱᓕᕆᒡᕕᐊ ▁ ᐊᒻᒪᓗ ▁ ᓄᓇᕗᑦ ▁ ᑐᙵᕕᒃ ▁ ᑎᒥᖓ ▁ ᐱᖁᔭᒃᓴᖅ ▁25 ▁ ᐱᔾᔪᑎᒋᓪᓗᒍ , ▁ ᓇᓗᓇᐃᕐᓯᓪᓗᓂ ▁ ᑖᒃᑯᐊ ▁ ᐅᖃᖃᑎᒌᓐᓂᕆᖃᑦᑕᓚᐅᖅᑕᖏᑦ ▁ ᐊᐅᓚᓂᖃᓚᐅᕐᓂᖏᓐᓂᒃ ▁ ᐅᑉᐱᕆᖃᑦᑕᐅᑎᓂᒃᑯᑦ ▁ ᐊᒻᒪᓗ ▁ ᒪᓂᒪᑎᑦᓯᕕᐅᓪᓗᑎᒃ ▁ ᑐᙵᕕᒃᑯᑦ ▁ ᐱᕕᒃᓴᖃᖅᑎᑕᐅᑦᓯᐊᖅᑐᑎᒃ ▁ ᐅᖃᐅᓯᖃᖁᓪᓗᒋᑦ ▁ ᐃᓱᒫᓘᑎᒥᓂᒃ ."
No wonder there are UNK
replacements!
Closing this since the issue has been identified as stemming from byte_fallback=False
by default.
The version with Python module-based checking using sp.is_unknown
lives in the decompose-sentencepiece-oov
branch.
Noting that it slows things down considerably, so it may not be worth it (especially for bilingual models which we can train using byte_fallback=True
)
As mjpost noted on Twitter, models often have large vocabs.
I noticed that this is happening here too. Here are SP model counts grepped from logs of
fairseq-preprocess
:Problem? Maybe not, but certainly not optimal. Consider rerunning bilingual baselines later with better vocabs once multilingual / multi-task is running (resources are limited)