If the word ngrian is split into the prefix n and the subword unit ##grian, the embedding table entry for ##grian will be an additional one to the entry grian, burdening BERT with having to learn the meaning of each separately. The choice to attach the glue ## to the token to the right of the split point seems to prefer suffix-heavy languages.
It is also noteworthy that this attachment of ## glue has an effect on vocabulary size: [X, Y, Xn, Yn] only require 3 entries [X, Y, ##n] while [X, Y, nX, nY] cannot be covered with just 3 entries.
If suffixes are less frequent than prefixes in Irish, a good test would be to reverse the order of letters in all words for all BERT input (pre-training data, parser training and parser test data) and see whether LAS goes up. This can be done externally treating BERT as a black box or by extending the BERT tokeniser to reverse each word as its first step.
If both prefixes and suffixes are frequent in Irish it would be interesting to see what happens if no ## are inserted as glue. This requires modifications to the BERT implementation we use. An approximation could be implemented externally (bert as a black box) as follows:
Create a BERT vocabulary as usual and keep a backup copy of the full vocabulary
Make a backup copy of all single character suffixes, i.e. entries ##X where X is a single character
Remove the first 2 characters from all entries that start with ##
Remove duplicate entries
Append the backup copy of single character suffixes (this is to ensure the vocab.txt file can be used with arbitrary input and to pass sanity checks the BERT tokeniser may be performing)
Pick a rare, light in meaning, non-alphanumeric character from the vocabulary as a special character to be used in the next step, _ in the example below
Pre-process all input text, inserting the special character selected above between any two alphanumeric characters where the BERT tokeniser would split a word given the original vocab.txt file from step 1
Train and use BERT with the vocab.txt file from step 5 and the pre-processed text files from step 7
Because all suffixes are split with a non-alphanumeric character, BERT will never use ## vocab entries, e.g. we input n_grian for ngrian and BERT produces 3 subword units n, _ and grian. The extra subword unit _ is why this approach is only an approximation to the sequence we want: [n, grian].
If the word
ngrian
is split into the prefixn
and the subword unit##grian
, the embedding table entry for##grian
will be an additional one to the entrygrian
, burdening BERT with having to learn the meaning of each separately. The choice to attach the glue##
to the token to the right of the split point seems to prefer suffix-heavy languages.It is also noteworthy that this attachment of
##
glue has an effect on vocabulary size:[X, Y, Xn, Yn]
only require 3 entries[X, Y, ##n]
while[X, Y, nX, nY]
cannot be covered with just 3 entries.If suffixes are less frequent than prefixes in Irish, a good test would be to reverse the order of letters in all words for all BERT input (pre-training data, parser training and parser test data) and see whether LAS goes up. This can be done externally treating BERT as a black box or by extending the BERT tokeniser to reverse each word as its first step.
If both prefixes and suffixes are frequent in Irish it would be interesting to see what happens if no
##
are inserted as glue. This requires modifications to the BERT implementation we use. An approximation could be implemented externally (bert as a black box) as follows:##X
whereX
is a single character##
vocab.txt
file can be used with arbitrary input and to pass sanity checks the BERT tokeniser may be performing)_
in the example belowvocab.txt
file from step 1vocab.txt
file from step 5 and the pre-processed text files from step 7Because all suffixes are split with a non-alphanumeric character, BERT will never use
##
vocab entries, e.g. we inputn_grian
forngrian
and BERT produces 3 subword unitsn
,_
andgrian
. The extra subword unit_
is why this approach is only an approximation to the sequence we want: [n
,grian
].Related: issue #63