Concatenate output of different tokenisers

jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.

Other

9 stars 0 forks source link

Given that the BERT-internal tokeniser splits any non-alpha-numeric characters, BERT is inherently robust to different tokenisations. Only differences that change the sequence of letters and numbers can matter, e.g. separating don't after do for English. (The UD treebank "en_partut" further replaces the apostrophe with o to produce not.)

Checking the first 600 IDT sentences for SpaceAfter=No between letters and %d-%d token IDs that would indicate replacements, UD-style tokenisation for Irish does not seem to modify alpha-numeric character sequences. Are there any other popular tokenisers for Irish? Does any of them change the sequence of letters?

jbrry / Irish-BERT

Concatenate output of different tokenisers #50