Oddly large vocabularies

j0ma commented 2 years ago

As mjpost noted on Twitter, models often have large vocabs.

I noticed that this is happening here too. Here are SP model counts grepped from logs of fairseq-preprocess:

./data-bin/en-cs/en_sp32k_cs_sp32k/default-train/preprocess.log:[cs] Dictionary: 43696 types
./data-bin/en-cs/en_sp32k_cs_sp32k/wmt-early/preprocess.log:[cs] Dictionary: 43696 types
./data-bin/en-cs/en_sp32k_cs_sp32k/wmt-late/preprocess.log:[cs] Dictionary: 43696 types
./data-bin/en-de/en_sp32k_de_sp32k/default-train/preprocess.log:[de] Dictionary: 53648 types
./data-bin/en-de/en_sp32k_de_sp32k/wmt-early/preprocess.log:[de] Dictionary: 53648 types
./data-bin/en-de/en_sp32k_de_sp32k/wmt-late/preprocess.log:[de] Dictionary: 53648 types
./data-bin/en-et/en_sp32k_et_sp32k/default-train/preprocess.log:[et] Dictionary: 42712 types
./data-bin/en-fi/en_sp32k_fi_sp32k/default-train/preprocess.log:[fi] Dictionary: 34904 types
./data-bin/en-fi/en_sp32k_fi_sp32k/newstest-2018/preprocess.log:[fi] Dictionary: 34904 types
./data-bin/en-fi/en_sp32k_fi_sp32k/newstest-2019/preprocess.log:[fi] Dictionary: 34904 types
./data-bin/en-iu/en_sp1k_iu_sp1k/hansard/preprocess.log:[iu] Dictionary: 1008 types
./data-bin/en-iu/en_sp1k_iu_sp1k/wmt20/preprocess.log:[iu] Dictionary: 1008 types
./data-bin/en-iu/en_sp32k_iu_sp32k/hansard/preprocess.log:[iu] Dictionary: 32008 types
./data-bin/en-iu/en_sp32k_iu_sp32k/wmt20/preprocess.log:[iu] Dictionary: 32008 types
./data-bin/en-ru/en_sp32k_ru_sp32k/default-train/preprocess.log:[ru] Dictionary: 148016 types
./data-bin/en-ru/en_sp32k_ru_sp32k/wmt-18-20/preprocess.log:[ru] Dictionary: 148016 types
./data-bin/en-tr/en_sp32k_tr_sp32k/default-train/preprocess.log:[tr] Dictionary: 37240 types
./data-bin/en-uz/en_sp32k_uz_sp32k/default-train/preprocess.log:[uz] Dictionary: 32200 types
./data-bin/en-vi/en_sp32k_vi_sp32k/default-train/preprocess.log:[vi] Dictionary: 39824 types
./data-bin/en-vi/en_sp32k_vi_sp32k/default-train-tokenized/preprocess.log:[vi] Dictionary: 39824 types

Problem? Maybe not, but certainly not optimal. Consider rerunning bilingual baselines later with better vocabs once multilingual / multi-task is running (resources are limited)

j0ma commented 2 years ago

Come to think of it, the output vocab size of EN-RU is probably related to #3

./data-bin/en-ru/en_sp32k_ru_sp32k/default-train/preprocess.log:[ru] Dictionary: 148016 types

j0ma commented 2 years ago

This is related to sentencepiece where unknown words are piped through unprocessed when not using IDs as the output format. Seems like someone had noticed this on Twitter and Matt Post filed an issue as well.

To get around this, set byte_fallback=True in the sentencepiece trainer code.

As a sanity check, we can try to segment a Russian word using a SP1k IU model:

Before:

In [6]: sp = spm.SentencePieceProcessor(model_file="iu_sp1k.bin")
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.

In [7]: sp.encode("погибли", out_type=str)
Out[7]: ['▁', 'погибли']

After making the change to byte_fallback=True, the model segments the OOV word into bytes:

In [8]: print(sp.encode("погибли", out_type=str))
['▁', '<0xD0>', '<0xBF>', '<0xD0>', '<0xBE>', '<0xD0>', '<0xB3>', '<0xD0>', '<0xB8>', '<0xD0>', '<0xB1>', '<0xD0>', '<0xBB>', '<0xD0>', '<0xB8>']

Why are <0xD0> inserted though?

j0ma commented 2 years ago

More sanity checking:

In [1]: import sentencepiece as spm

In [2]: sp = spm.SentencePieceProcessor(model_file="iu_sp1k.bin")
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.

In [3]: iku = "ᐃᓄᒃᑎᑐᑦ"

In [4]: sp.encode(iku, out_type=str)
Out[4]: ['▁ᐃᓄᒃ', 'ᑎ', 'ᑐ', 'ᑦ']

In [5]: sp.encode(iku, out_type=int)
Out[5]: [750, 275, 335, 261]

In [6]: cyr = "погибли"

In [7]: sp.encode(cyr, out_type=str)
Out[7]: 
['▁',
 '<0xD0>',
 '<0xBF>',
 '<0xD0>',
 '<0xBE>',
 '<0xD0>',
 '<0xB3>',
 '<0xD0>',
 '<0xB8>',
 '<0xD0>',
 '<0xB1>',
 '<0xD0>',
 '<0xBB>',
 '<0xD0>',
 '<0xB8>']

In [8]: sp.encode(cyr, out_type=int)
Out[8]: [266, 211, 194, 211, 193, 211, 182, 211, 187, 211, 180, 211, 190, 211, 187]

In [12]: heb = "יִשְׂרָאֵל"

In [13]: sp.encode(heb, out_type=str)
Out[13]: 
['▁',
 '<0xD7>',
 '<0x99>',
 '<0xD6>',
 '<0xB4>',
 '<0xD7>',
 '<0xA9>',
 '<0xD6>',
 '<0xB0>',
 '<0xD7>',
 '<0x82>',
 '<0xD7>',
 '<0xA8>',
 '<0xD6>',
 '<0xB8>',
 '<0xD7>',
 '<0x90>',
 '<0xD6>',
 '<0xB5>',
 '<0xD7>',
 '<0x9C>']

In [14]: sp.encode(heb, out_type=int)
Out[14]: 
[266,
 218,
 156,
 217,
 183,
 218,
 172,
 217,
 179,
 218,
 133,
 218,
 171,
 217,
 187,
 218,
 147,
 217,
 184,
 218,
 159]

Non-Latin alphabets that aren't included in the training data seem to cause extra bytes to be inserted. Doesn't happen for Latin alphabet as those characters are included in training data.

j0ma commented 2 years ago

After re-processing using byte_fallback=True, the vocab sizes seem much more sane now:

./data-bin/en-uz/en_sp4k_uz_sp4k/default-train/preprocess.log
[en] Dictionary: 3744 types
[en] Dictionary: 3744 types
[en] Dictionary: 3744 types
[uz] Dictionary: 3744 types
[uz] Dictionary: 3744 types
[uz] Dictionary: 3744 types

./data-bin/en-tr/en_sp32k_tr_sp32k/default-train/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[tr] Dictionary: 31848 types
[tr] Dictionary: 31848 types
[tr] Dictionary: 31848 types

./data-bin/en-ru/en_sp32k_ru_sp32k/wmt-18-20/preprocess.log
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types

./data-bin/en-iu/en_sp1k_iu_sp1k/wmt20/preprocess.log
[en] Dictionary: 760 types
[en] Dictionary: 760 types
[iu] Dictionary: 752 types
[iu] Dictionary: 752 types

./data-bin/en-iu/en_sp1k_iu_sp1k/hansard/preprocess.log
[en] Dictionary: 760 types
[en] Dictionary: 760 types
[en] Dictionary: 760 types
[iu] Dictionary: 752 types
[iu] Dictionary: 752 types
[iu] Dictionary: 752 types

./data-bin/en-fi/en_sp32k_fi_sp32k/newstest-2019/preprocess.log
[en] Dictionary: 31848 types
[fi] Dictionary: 31848 types

./data-bin/en-uz/en_sp1k_uz_sp1k/default-train/preprocess.log
[en] Dictionary: 752 types
[en] Dictionary: 752 types
[en] Dictionary: 752 types
[uz] Dictionary: 744 types
[uz] Dictionary: 744 types
[uz] Dictionary: 744 types

./data-bin/en-fi/en_sp32k_fi_sp32k/newstest-2018/preprocess.log
[en] Dictionary: 31848 types
[fi] Dictionary: 31848 types

./data-bin/en-fi/en_sp32k_fi_sp32k/default-train/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[fi] Dictionary: 31848 types
[fi] Dictionary: 31848 types

./data-bin/en-et/en_sp32k_et_sp32k/default-train/preprocess.log
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[et] Dictionary: 31848 types
[et] Dictionary: 31848 types
[et] Dictionary: 31848 types

./data-bin/en-de/en_sp32k_de_sp32k/wmt-late/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types

./data-bin/en-de/en_sp32k_de_sp32k/wmt-early/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types

./data-bin/en-ru/en_sp32k_ru_sp32k/default-train/preprocess.log
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[en] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types
[ru] Dictionary: 31840 types

./data-bin/en-de/en_sp32k_de_sp32k/default-train/preprocess.log
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[en] Dictionary: 31848 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types
[de] Dictionary: 31840 types

./data-bin/en-cs/en_sp32k_cs_sp32k/wmt-late/preprocess.log
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types

./data-bin/en-cs/en_sp32k_cs_sp32k/wmt-early/preprocess.log
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types

./data-bin/en-cs/en_sp32k_cs_sp32k/default-train/preprocess.log
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[en] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types
[cs] Dictionary: 31856 types

j0ma commented 2 years ago

Just leaving a note here that applying mBARTs's sentence.bpe.model to IU (which it was not trained on) gives the following

[en] Dictionary: 250001 types
[en] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.train.en: 1293439 sents, 26237836 tokens, 5.72e-05% replaced by <unk>
[en] Dictionary: 250001 types
[en] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.dev.en: 2674 sents, 77912 tokens, 0.0% replaced by <unk>
[en] Dictionary: 250001 types
[en] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.test.en: 3602 sents, 104534 tokens, 0.0% replaced by <unk>
[iu] Dictionary: 250001 types
[iu] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.train.iu: 1293439 sents, 19427065 tokens, 39.2% replaced by <unk>
[iu] Dictionary: 250001 types
[iu] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.dev.iu: 2674 sents, 56301 tokens, 41.1% replaced by <unk>
[iu] Dictionary: 250001 types
[iu] /home/jonne/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/hansard/en-iu.test.iu: 3602 sents, 80432 tokens, 41.4% replaced by <unk>
Wrote preprocessed data to data-bin/en-iu/en_spmbart_iu_spmbart/hansard

At first I thought the vocab was too big. Then I applied the model to Estonian, and got:

2022-01-25 15:03:17 | INFO | fairseq_cli.preprocess | [en] Dictionary: 250001 types
2022-01-25 15:04:41 | INFO | fairseq_cli.preprocess | [en] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.train.en: 13528733 sents, 316437843 tokens, 0.00424% replaced by <unk>
2022-01-25 15:04:41 | INFO | fairseq_cli.preprocess | [en] Dictionary: 250001 types
2022-01-25 15:04:45 | INFO | fairseq_cli.preprocess | [en] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.dev.en: 2000 sents, 54722 tokens, 0.0% replaced by <unk>
2022-01-25 15:04:45 | INFO | fairseq_cli.preprocess | [en] Dictionary: 250001 types
2022-01-25 15:04:50 | INFO | fairseq_cli.preprocess | [en] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.test.en: 2000 sents, 58393 tokens, 0.0% replaced by <unk>
2022-01-25 15:04:50 | INFO | fairseq_cli.preprocess | [et] Dictionary: 250001 types
2022-01-25 15:06:20 | INFO | fairseq_cli.preprocess | [et] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.train.et: 13528733 sents, 348604116 tokens, 0.00364% replaced by <unk>
2022-01-25 15:06:20 | INFO | fairseq_cli.preprocess | [et] Dictionary: 250001 types
2022-01-25 15:06:25 | INFO | fairseq_cli.preprocess | [et] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.dev.et: 2000 sents, 59090 tokens, 0.0% replaced by <unk>
2022-01-25 15:06:25 | INFO | fairseq_cli.preprocess | [et] Dictionary: 250001 types
2022-01-25 15:06:29 | INFO | fairseq_cli.preprocess | [et] /home/jonne/datasets/mrl_nmt22/processed/en-et/en_spmbart_et_spmbart/default-train/en-et.test.et: 2000 sents, 62865 tokens, 0.0% replaced by <unk>
2022-01-25 15:06:29 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to /home/jonne/mrl_nmt22/data-bin/en-et/en_spmbart_et_spmbart/default-train

Both have ~250k vocab size which matches the mBART paper:

However, note how high the UNK replacement rate is in dev / test for IU.

Looking at how IU gets segmented, it's clear that nothing is getting segmented properly:

(torch-rtx-3090) jonne@lignos07:~/datasets/mrl_nmt22/processed/en-iu/en_spmbart_iu_spmbart/wmt20$ paste en-iu.dev{.detok.iu,.iu} | tail -n 20 | head -n 1 | tr '\t' '\n'
"ᒥᓂᔅᑕ ᔪᐊᓇᓯ ᑎᑎᕋᓚᐅᖅᑐᖅ ᓴᖅᑭᖅᑕᐅᓪᓗᑎᒃ ᓄᓇᑦᓯᐊᖅ ᓅᔅᑯᓐᓄᑦ ᓅᕙᐃᒻᕙ 25, 2019 ᐅᖃᐅᓯᓕᓐᓂᒃ ᐅᖃᖃᑎᒌᖃᑦᑕᕐᓯᒪᓂᖏᓐᓂᒃ ᐃᓕᓐᓂᐊᕐᓂᓕᕆᔨᒃᑯᑦ ᐱᓕᕆᒡᕕᐊ ᐊᒻᒪᓗ ᓄᓇᕗᑦ ᑐᙵᕕᒃ ᑎᒥᖓ ᐱᖁᔭᒃᓴᖅ 25 ᐱᔾᔪᑎᒋᓪᓗᒍ, ᓇᓗᓇᐃᕐᓯᓪᓗᓂ ᑖᒃᑯᐊ ᐅᖃᖃᑎᒌᓐᓂᕆᖃᑦᑕᓚᐅᖅᑕᖏᑦ ᐊᐅᓚᓂᖃᓚᐅᕐᓂᖏᓐᓂᒃ ᐅᑉᐱᕆᖃᑦᑕᐅᑎᓂᒃᑯᑦ ᐊᒻᒪᓗ ᒪᓂᒪᑎᑦᓯᕕᐅᓪᓗᑎᒃ ᑐᙵᕕᒃᑯᑦ ᐱᕕᒃᓴᖃᖅᑎᑕᐅᑦᓯᐊᖅᑐᑎᒃ ᐅᖃᐅᓯᖃᖁᓪᓗᒋᑦ ᐃᓱᒫᓘᑎᒥᓂᒃ."
▁" ᒥᓂᔅᑕ ▁ ᔪᐊᓇᓯ ▁ ᑎᑎᕋᓚᐅᖅᑐᖅ ▁ ᓴᖅᑭᖅᑕᐅᓪᓗᑎᒃ ▁ ᓄᓇᑦᓯᐊᖅ ▁ ᓅᔅᑯᓐᓄᑦ ▁ ᓅᕙᐃᒻᕙ ▁25 , ▁2019 ▁ ᐅᖃᐅᓯᓕᓐᓂᒃ ▁ ᐅᖃᖃᑎᒌᖃᑦᑕᕐᓯᒪᓂᖏᓐᓂᒃ ▁ ᐃᓕᓐᓂᐊᕐᓂᓕᕆᔨᒃᑯᑦ ▁ ᐱᓕᕆᒡᕕᐊ ▁ ᐊᒻᒪᓗ ▁ ᓄᓇᕗᑦ ▁ ᑐᙵᕕᒃ ▁ ᑎᒥᖓ ▁ ᐱᖁᔭᒃᓴᖅ ▁25 ▁ ᐱᔾᔪᑎᒋᓪᓗᒍ , ▁ ᓇᓗᓇᐃᕐᓯᓪᓗᓂ ▁ ᑖᒃᑯᐊ ▁ ᐅᖃᖃᑎᒌᓐᓂᕆᖃᑦᑕᓚᐅᖅᑕᖏᑦ ▁ ᐊᐅᓚᓂᖃᓚᐅᕐᓂᖏᓐᓂᒃ ▁ ᐅᑉᐱᕆᖃᑦᑕᐅᑎᓂᒃᑯᑦ ▁ ᐊᒻᒪᓗ ▁ ᒪᓂᒪᑎᑦᓯᕕᐅᓪᓗᑎᒃ ▁ ᑐᙵᕕᒃᑯᑦ ▁ ᐱᕕᒃᓴᖃᖅᑎᑕᐅᑦᓯᐊᖅᑐᑎᒃ ▁ ᐅᖃᐅᓯᖃᖁᓪᓗᒋᑦ ▁ ᐃᓱᒫᓘᑎᒥᓂᒃ ."

No wonder there are UNK replacements!

j0ma commented 2 years ago

Closing this since the issue has been identified as stemming from byte_fallback=False by default. The version with Python module-based checking using sp.is_unknown lives in the decompose-sentencepiece-oov branch. Noting that it slows things down considerably, so it may not be worth it (especially for bilingual models which we can train using byte_fallback=True)

j0ma / mrl_nmt22

Oddly large vocabularies #5