jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Investigate effect of ## glue on prefixes #80

Open jowagner opened 3 years ago

jowagner commented 3 years ago

If the word ngrian is split into the prefix n and the subword unit ##grian, the embedding table entry for ##grian will be an additional one to the entry grian, burdening BERT with having to learn the meaning of each separately. The choice to attach the glue ## to the token to the right of the split point seems to prefer suffix-heavy languages.

It is also noteworthy that this attachment of ## glue has an effect on vocabulary size: [X, Y, Xn, Yn] only require 3 entries [X, Y, ##n] while [X, Y, nX, nY] cannot be covered with just 3 entries.

If suffixes are less frequent than prefixes in Irish, a good test would be to reverse the order of letters in all words for all BERT input (pre-training data, parser training and parser test data) and see whether LAS goes up. This can be done externally treating BERT as a black box or by extending the BERT tokeniser to reverse each word as its first step.

If both prefixes and suffixes are frequent in Irish it would be interesting to see what happens if no ## are inserted as glue. This requires modifications to the BERT implementation we use. An approximation could be implemented externally (bert as a black box) as follows:

  1. Create a BERT vocabulary as usual and keep a backup copy of the full vocabulary
  2. Make a backup copy of all single character suffixes, i.e. entries ##X where X is a single character
  3. Remove the first 2 characters from all entries that start with ##
  4. Remove duplicate entries
  5. Append the backup copy of single character suffixes (this is to ensure the vocab.txt file can be used with arbitrary input and to pass sanity checks the BERT tokeniser may be performing)
  6. Pick a rare, light in meaning, non-alphanumeric character from the vocabulary as a special character to be used in the next step, _ in the example below
  7. Pre-process all input text, inserting the special character selected above between any two alphanumeric characters where the BERT tokeniser would split a word given the original vocab.txt file from step 1
  8. Train and use BERT with the vocab.txt file from step 5 and the pre-processed text files from step 7

Because all suffixes are split with a non-alphanumeric character, BERT will never use ## vocab entries, e.g. we input n_grian for ngrian and BERT produces 3 subword units n, _ and grian. The extra subword unit _ is why this approach is only an approximation to the sequence we want: [n, grian].

Related: issue #63