jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Concatenate output of different tokenisers #50

Open jowagner opened 3 years ago

jowagner commented 3 years ago

Rather than having to tell users of our BERT models what tokeniser to use it would be nice to be robust to the choice of tokenisers. Robustness is likely to improve by combining data obtained with different tokenisers, preferably the most popular ones.

To some extend we are doing this already:

jowagner commented 3 years ago

Given that the BERT-internal tokeniser splits any non-alpha-numeric characters, BERT is inherently robust to different tokenisations. Only differences that change the sequence of letters and numbers can matter, e.g. separating don't after do for English. (The UD treebank "en_partut" further replaces the apostrophe with o to produce not.)

Checking the first 600 IDT sentences for SpaceAfter=No between letters and %d-%d token IDs that would indicate replacements, UD-style tokenisation for Irish does not seem to modify alpha-numeric character sequences. Are there any other popular tokenisers for Irish? Does any of them change the sequence of letters?