Open jowagner opened 3 years ago
Related: issue #44
Just to note that the bert-base-uncased vocabulary (using HuggingFace) contains 997 single character tokens, and it also includes the corresponding 997 ##subword tokens for each of those single character tokens.
To take just a few examples, each of the following characters occur separately in the vocabulary as e.g. ! "
etc. and separately as ##! ##"
etc.
! " # $ % & ' ( ) * + , - . / etc.
##! ##" ### ##$ ##% ##& ##' ##( ##) ##* ##+ ##, ##- ##. ##/ etc.
Given BERTs tokenization as you mentioned it's hard to see how any of the ##subword tokens are used. Perhaps they're included for use with other tokenizer schemes; or it's related to BERTs original C++ code which had "some additional complexity".
As per https://github.com/jbrry/Irish-BERT/issues/72#issuecomment-830132236, we switched to the WordPiece tokeniser. Are there still unusable entries?
I added the vocab.txt
from the latest run with No-filter
to Theme A DCU/ga_BERT/BERT_Preprocessing/vocab.txt
EDIT: I don't seem to find the ##subword tokens for the punctuation symbols (but they are there as single items).
Sorting the new and old no-filter vocabs with LC_ALL=C sort
and comparing with diff
and kompare
, I can confirm the glue punctuation problem is gone or at least less pronounced. There are still some cases of characters I am not sure about, e.g. ##
followed by an arrow pointing right.
Other observations:
Sorting the new and old no-filter vocabs
Sorry, there is a small change from the older runs. Assuming you are using the vocab.txt
from the directory with <corpora_prefix>_filtering_None
, the None
here corresponds to no additional OpusFilter filtering and using the document filters in the wikibert-pipeline (which corresponds to the Document-heuristic
filter in the current runs).
I can add all of the vocab.txt files from the None
, Document-filter
, OpusFilter-basic
and OpusFilter-basic-char-lang
configurations if you would like to look at how some of the tokens change between runs, or compare to older runs?
If comparing with older runs, the below should be equivalent (bear in mind current runs contain a more recent Wikipedia dump).
Old | Current |
---|---|
NA | None |
None | Document-heuristic |
basic+char-1.0+lang-0.8 | basic+char-1.0+lang-0.8 |
Is there a threshold for the minimum frequency a character must have in order to be included?
The min_frequency
value is set to 2
. The new vocab generation follows the same procedure as Turkish BERT
Is this set differently for the two vocabulary builders?
I see no corresponding options in SentencePiece's train flags.
The params passed to spm_train
in the wikibert-pipeline are:
SENTENCEPIECE_PARAMS="
--vocab_size=30000
--input_sentence_size=100000000
--shuffle_input_sentence=true
--character_coverage=0.9999
--model_type=bpe
"
So I assume --character_coverage
(where its flag description is "character coverage to determine the minimum symbols") only keeps the given proportion of characters, filtering out the least common (1 - character_coverage
) characters.
Many Irish words and subword units are no longer present in the new vocab. This presumably comes from the competition from the new entries under a fixed vocab size.
This is good to know. I also assume it's from the greater proportion of noisy texts in the No-filter
configuration and could likely change with the other filtering configurations. I will upload these to the aforementioned destination for posterity.
I added the vocab files from the 4 current filtering configurations, as well as the older two runs (with "SentencePiece" suffix) to Theme A DCU/ga_BERT/BERT_Preprocessing/vocabs
. It should be interesting to analyse how the vocab entries change between filters (and across SentencePiece and WordPiece).
To be sure the right files are compared, it's probably best if you do it. Here the commands I used:
LC_ALL=C sort < vocab.txt > gabert-new-vocab-sorted.txt
LC_ALL=C sort < conll17_gdrive_NCI_oscar_paracrawl_filtering_None/vocab.txt > vocab-no-filter-sorted.txt
diff -U 5 vocab-no-filter-sorted.txt gabert-new-vocab-sorted.txt > x.patch
kompare x.patch
(In text-only mode you can use diff -y vocab-no-filter-sorted.txt gabert-new-vocab-sorted.txt | less
to get a side-by-side view but this doesn't skip identical sections.)
There is at least one unusable vocabulary entry in our gabert vocab, namely
##-"
. Find all entries that the BERT will never use as BERT first splits around all non-alphanumeric characters without applying##
glue and replace them with[unused123]
where 123 is the next free index.