Merge subcorpus-specific wordpiece vocabularies

jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.

Other

9 stars 0 forks source link

Merge subcorpus-specific wordpiece vocabularies #33

Open jowagner opened 3 years ago

jowagner commented 3 years ago

When training on Irish, English and possibly other languages, Hung et al. (2020) "Improving Multilingual Models with Language-Clustered Vocabularies" suggest to create wordpiece vocabularies for clusters of related languages and then use the union of these vocabularies as the final vocabulary used during BERT training and prediction. For us this could mean to split the data into (1) clearly English only, (2) clearly Irish only and (3) all other text, train 3 vocabularies and merge them.

jbrry commented 3 years ago

Good idea. The script https://github.com/jbrry/wiki-bert-pipeline/blob/858d323e1fa3a63368441d68309d5afb9389d3fe/external_scripts/gather_external_data.py#L17 supports gathering specific corpora and then launching the wiki-bert-pipeline. For (1), we could start out with a corpus 'parallel-en' (which is all the en side of the parallel text in gdrive) and it could be run through the wiki-bert-pipeline to generate the wordpiece vocabulary which is en only.

I haven't read the above paper yet but I wonder how easy it is to merge vocabularies? Is it as simple as just merging three vocab.txt files, which look like the below? (and removing duplicates):

ga/vocab.txt

[UNK]
[CLS]
[SEP]
[MASK]
a
##ch
##ai
##ea
s
d
...

en/vocab.txt

[UNK]
[CLS]
[SEP]
[MASK]
a
##n
##ex
##ample
...

jowagner commented 3 years ago

A way to find out would be to remove all other intermediate output files generated when building the vocabulary and to see whether BERT still trains as usual. If it does this means it is only using the vocab.txt file as the basis for the vocabulary.

The example suggests the regular entries do not need to be in any particular order but I'd guess that the first 4 special entries must be at the start. This could be done manually with

head -n 4 ga/vocab.txt > combined/vocab.txt 
tail -q -n +5 ??/vocab.txt | LC_ALL=C sort | uniq >> combined/vocab.txt

alanagiasi commented 3 years ago

The BERT vocabulary (for bert-base-uncased) is laid out as follows:

1   [PAD]
2   [unused0]
... ...
100 [unused98]
101 [UNK]
102 [CLS]
103 [SEP]
104 [MASK]
105 [unused99]
... ...
999 [unused993]
1000    !
1001    <more single characters, possibly in UTF-8 encoding ascending order>
...     ...
1997    the
1998    <more words and subwords, possibly sorted by frequency in descending order>
30522   ##~

The vocabulary is 30,522 tokens: the first 999 tokens are reserved e.g. [unused993] including the tokens 1, 101-105. It appears the vocabulary is then laid out as individual characters (possibly in UTF-8 ascending order), followed by words and subwords (possibly sorted by frequency in descending order). The unused tokens can be used to add custom tokens, this may be intended for fine-tuning purposes rather than training from scratch, I'm not certain.

As I understand it the vocabulary is used as a dictionary mapping word_string : word_id e.g. the : 1997. The word_id is subsequently used in a word embedding lookup table (30522 x 768) to retrieve the embedding (typically each embedding has 768 features).

jbrry commented 3 years ago

Thanks Alan. Also FYI Joachim, the example I posted skips lines 1-101 in a vocab file, which as Alan pointed out are [PAD] - [unused99] (though the example Alan uses ranges from 0:999). I think the vocab file for bert-base-uncased must be different from multilingual BERT as mBERT only keeps 99 places for unseen tokens, e.g. Footnote 5 in Chau et al., (2020) mentions:

MBERT’s fixed-size vocabulary contains 99 tokens designated as “unused,” whose representations were not updated during initial pretraining and can be repurposed for vocabulary augmentation without modifying the pretrained model.

Also the vocab file I am using for mBERT only keeps 99 as well. They are also in the format of just one wordpiece token per line:

##ução
##шни
administración
Italiae
##ждения

Perhaps they changed how they write vocab files between bert-base-uncased and multilingual-bert. In any case, I imagine the word keys are hard-coded to a token ID value in the model itself, even if that's not how they write it for multilingual-bert. So in one model, 'apple' may be mapped to ID 102 and in a model in another language, e.g. French, 'rouge' may be mapped to ID 102 so it may mean that the word-to-ID lookup dictionary has to be changed to accommodate the different key-value pairs.

jowagner commented 3 years ago

Thanks.

Yes, I also think that when you work with an existing model you must not append entries to the vocab files or change the order of existing entries. Vocab files with such changes are only useful for training from scratch. As the footnote quoted above says, the unused entries are there to help people add some entries in fine-tuning.