Rename unusable vocabulary entries

jowagner commented 3 years ago

There is at least one unusable vocabulary entry in our gabert vocab, namely ##-". Find all entries that the BERT will never use as BERT first splits around all non-alphanumeric characters without applying ## glue and replace them with [unused123] where 123 is the next free index.

jowagner commented 3 years ago

Related: issue #44

alanagiasi commented 3 years ago

Just to note that the bert-base-uncased vocabulary (using HuggingFace) contains 997 single character tokens, and it also includes the corresponding 997 ##subword tokens for each of those single character tokens. To take just a few examples, each of the following characters occur separately in the vocabulary as e.g. ! " etc. and separately as ##! ##" etc.

! " # $ % & ' ( ) * + , - . / etc.
##! ##" ### ##$ ##% ##& ##' ##( ##) ##* ##+ ##, ##- ##. ##/ etc.

Given BERTs tokenization as you mentioned it's hard to see how any of the ##subword tokens are used. Perhaps they're included for use with other tokenizer schemes; or it's related to BERTs original C++ code which had "some additional complexity".

jowagner commented 3 years ago

As per https://github.com/jbrry/Irish-BERT/issues/72#issuecomment-830132236, we switched to the WordPiece tokeniser. Are there still unusable entries?

jbrry commented 3 years ago

I added the vocab.txt from the latest run with No-filter to Theme A DCU/ga_BERT/BERT_Preprocessing/vocab.txt

EDIT: I don't seem to find the ##subword tokens for the punctuation symbols (but they are there as single items).

jowagner commented 3 years ago

Sorting the new and old no-filter vocabs with LC_ALL=C sort and comparing with diff and kompare, I can confirm the glue punctuation problem is gone or at least less pronounced. There are still some cases of characters I am not sure about, e.g. ## followed by an arrow pointing right.

Other observations:

The new vocab knows a lot more foreign letters, emoticons and English ordinal numbers. Is there a threshold for the minimum frequency a character must have in order to be included? Is this set differently for the two vocabulary builders?
Many Irish words and subword units are no longer present in the new vocab. This presumably comes from the competition from the new entries under a fixed vocab size.

jbrry commented 3 years ago

Sorting the new and old no-filter vocabs

Sorry, there is a small change from the older runs. Assuming you are using the vocab.txt from the directory with <corpora_prefix>_filtering_None, the None here corresponds to no additional OpusFilter filtering and using the document filters in the wikibert-pipeline (which corresponds to the Document-heuristic filter in the current runs).

I can add all of the vocab.txt files from the None, Document-filter, OpusFilter-basic and OpusFilter-basic-char-lang configurations if you would like to look at how some of the tokens change between runs, or compare to older runs?

If comparing with older runs, the below should be equivalent (bear in mind current runs contain a more recent Wikipedia dump).

Old	Current
NA	None
None	Document-heuristic
basic+char-1.0+lang-0.8	basic+char-1.0+lang-0.8

Is there a threshold for the minimum frequency a character must have in order to be included?

The min_frequency value is set to 2. The new vocab generation follows the same procedure as Turkish BERT

Is this set differently for the two vocabulary builders?

I see no corresponding options in SentencePiece's train flags.

The params passed to spm_train in the wikibert-pipeline are:

SENTENCEPIECE_PARAMS="
--vocab_size=30000
--input_sentence_size=100000000
--shuffle_input_sentence=true
--character_coverage=0.9999
--model_type=bpe
"

So I assume --character_coverage (where its flag description is "character coverage to determine the minimum symbols") only keeps the given proportion of characters, filtering out the least common (1 - character_coverage) characters.

Many Irish words and subword units are no longer present in the new vocab. This presumably comes from the competition from the new entries under a fixed vocab size.

This is good to know. I also assume it's from the greater proportion of noisy texts in the No-filter configuration and could likely change with the other filtering configurations. I will upload these to the aforementioned destination for posterity.

jbrry commented 3 years ago

I added the vocab files from the 4 current filtering configurations, as well as the older two runs (with "SentencePiece" suffix) to Theme A DCU/ga_BERT/BERT_Preprocessing/vocabs. It should be interesting to analyse how the vocab entries change between filters (and across SentencePiece and WordPiece).

jowagner commented 3 years ago

To be sure the right files are compared, it's probably best if you do it. Here the commands I used:

LC_ALL=C sort < vocab.txt > gabert-new-vocab-sorted.txt
LC_ALL=C sort < conll17_gdrive_NCI_oscar_paracrawl_filtering_None/vocab.txt > vocab-no-filter-sorted.txt 
diff -U 5 vocab-no-filter-sorted.txt gabert-new-vocab-sorted.txt > x.patch
kompare x.patch

(In text-only mode you can use diff -y vocab-no-filter-sorted.txt gabert-new-vocab-sorted.txt | less to get a side-by-side view but this doesn't skip identical sections.)

jbrry / Irish-BERT

Rename unusable vocabulary entries #62