Open jowagner opened 3 years ago
Given that the BERT-internal tokeniser splits any non-alpha-numeric characters, BERT is inherently robust to different tokenisations. Only differences that change the sequence of letters and numbers can matter, e.g. separating don't
after do
for English. (The UD treebank "en_partut" further replaces the apostrophe with o
to produce not
.)
Checking the first 600 IDT sentences for SpaceAfter=No
between letters and %d-%d
token IDs that would indicate replacements, UD-style tokenisation for Irish does not seem to modify alpha-numeric character sequences. Are there any other popular tokenisers for Irish? Does any of them change the sequence of letters?
Rather than having to tell users of our BERT models what tokeniser to use it would be nice to be robust to the choice of tokenisers. Robustness is likely to improve by combining data obtained with different tokenisers, preferably the most popular ones.
To some extend we are doing this already: