Populate unused vocabulary entries of our mBERT-based models

Issue #33 points out that there are 99 unused entries in the mBERT vocabulary intended for users to add task-specific vocabulary entries for fine-tuning. We could use the entries to improve the vocabulary's coverage of Irish without having to train from scratch. However, to not put stones in the way of users of our models who want to use unused entries for their own tasks, we should not use all 99 entries.

A way to choose the entries to add would be to induce new vocabularies for a clean Irish corpus, reducing the size until the number of new entries, i.e. entries that are not in the mBERT vocabulary, is less than or equal to the number of entries we want to add, say 49.

jbrry / Irish-BERT

Populate unused vocabulary entries of our mBERT-based models #42