Inspect how BERT tokenization affects tokens which are composed of characters and punctuation

jowagner commented 3 years ago

Since inconsistencies in the tokenisation are hard to avoid when working with corpora from different sources, it may help the final model to force tokens like "etc." to be split into two word pieces by removing vocabulary entries X+PUNCT if X is in the vocabulary and replacing X+PUNCT with X if not before we train BERT, in particular if the user's tokeniser splits more aggressively than our tokenisers.

jowagner commented 3 years ago

Our recent gaBERT from scratch model does not pick any X+PUNCT tokens for the vocabulary:

$ grep -E '[.?!]' vocab.txt 
.
?
!
..
...
##.
##?
##!

Neither does mBERT:

$ grep -E '[.?!]' multi_cased_L-12_H-768_A-12/vocab.txt 
!
.
?
...
##!
##.
##?

It is surprising that frequent abbreviations do not make it into the vocabulary.

jbrry commented 3 years ago

Do these abbreviations occur in the X+PUNCT format in the raw text files before they are converted to the vocab?

See pretraining data from the latest run here: https://drive.google.com/drive/u/0/folders/1O9XEI4osV9Zr8MzheujLebRASqaZIaoK

After untarring the above file, you should be able to see the tokenized sentences in the below directory:

cd conll17_gdrive_NCI_oscar_paracrawl_filtering_None/ga/tokenized-texts

jowagner commented 3 years ago

All tokenised corpora contain plenty of X+PUNCT tokens:

$ for P in c g N o p w ; do echo == $P == ; grep -h -E -o " [A-Za-z][A-Za-z.]*[.] " ${P}* | \
LC_ALL=C sort | uniq -c | sort -n | tail -n 5 ; done
== c ==
    741  F.C. 
    755  J. 
   2207  Co. 
   3047  b. 
   4590  r. 
== g ==
    654  etc. 
    778  T.D. 
   1221  I.R. 
   1766  lch. 
   2424  Uimh. 
== N ==
   1496  e.g. 
   1626  i.e. 
   3181  Co. 
   3761  lch. 
   5113  Uimh. 
== o ==
    161  c.s. 
    178  Uimh. 
    300  srl. 
    355  D. 
    897  Co. 
== p ==
   2422  lch. 
   2976  etc. 
   3621  Co. 
  10831  Lch. 
  17228  Uimh. 
== w ==
    244  F.C. 
    270  p. 
    359  J. 
    587  ll. 
   1328  lch.

Either these frequencies are not high enough for inclusion in the vocabulary or the tools that build the vocabulary carry out additional tokenisation. Trying to get BERT vectors for tokenised text for dependency parsing, I notice that huggingface's BERT interface splits at all whitespace and all punctuation characters and there is no option to switch this off:

from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained('bert-base-uncased')
example_batch = [
    'hello world !'.split(),
    """tokenisation 's trouble""".split(),
]
tokenised_text = tokeniser(
    example_batch,
    # pre-tokenised input
    is_split_into_words = True,   # TODO: this doesn't seem to do what we expect it to do
)
print('converted back:')
for i, token_ids in enumerate(tokenised_text['input_ids']):
    print(i, tokeniser.convert_ids_to_tokens(token_ids))

producing

converted back:
0 ['[CLS]', 'hello', 'world', '!', '[SEP]']
1 ['[CLS]', 'token', '##isation', "'", 's', 'trouble', '[SEP]']

The apostrophe-s should not be split without "##" glue. Some tokenisation is going on before the tokens are split into word pieces. Looking at the source code, it seems that BERT always splits at whitespace and non-alpha-numeric characters, called "punctuation" in the code. (For the purpose of getting the vector of the first wordpiece of each of my tokens I can hack it by calling BERT's tokeniser for each of my tokens separately and tracing what comes out. Also wonder what BERT-based sequence taggers and dependency parsers do.)

If this is what BERT does it only makes sense that the tools for building a BERT vocabulary do the same, i.e. aggressively split any non-alpha-numeric characters from any sequences of alpha-numeric characters, before collecting character sequence statistics and deciding what sequences to include in the vocabulary.

If all major BERT implementations split at non-alpha-numeric characters there is no point in trying to fix this and also any improvements in tokenisation are a waste of time as all non-trivial tokenisation issues involve non-alpha-numeric characters and whatever we do BERT will change it.

jbrry commented 3 years ago

Yes, it seems there is some additional tokenization procedure going on apart from running a typical, UDPipe, SpaCy, moses tokenizer etc.:

BERT's Tokenizer WikiBERT's berttokenizer

For sequence taggers and dependency parsers, which require 1 input per word, there are tools to create offsets between the tokenized input and the word pieces which map an index from the tokenized sentence to its wordpieces. Then those wordpieces can be averaged for each token or take the first one.

The implementation in AllenNLP can be found here: https://github.com/allenai/allennlp/blob/d0a07fb32811c649185ee99d71373cc7cab8791e/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py

jowagner commented 3 years ago

It looks like each implementation is independent but does the same thing, e.g. google's _run_split_on_punc() calls a _is_punctuation() function that uses the same definition of punctuation as huggingface. Hopefully, AllenNLP does the same. If not, there may be a small performance degradation when using our AllenNLP-trained gaBERT model with huggingface or google libraries.

As to reconstructing the original tokenisation, the code you linked shows that each word piece has a text_id attribute that tells users what input token it comes from. To check how it is assigned, we'd have to check the implementation of intra_word_tokenize(). In huggingface's run_ner.py example, I see transformers has the info in tokenised_text.word_ids(batch_index = i). I added it to my test code above and it looks good.

The only open question for this issue is then whether the vocabulary builder also applies this aggressive tokenisation. A simple test would be to run it on a toy dataset that contains an X+PUNCT token and see whether it show up in the vocabulary, setting the vocabulary size to the number of characters in the input so that it cannot be fully used. Can you do this?

jbrry commented 3 years ago

The only open question for this issue is then whether the vocabulary builder also applies this aggressive tokenisation. A simple test would be to run it on a toy dataset that contains an X+PUNCT token and see whether it show up in the vocabulary, setting the vocabulary size to the number of characters in the input so that it cannot be fully used. Can you do this?

Good idea. Yes, I can do this but probably can't take a look at it until tomorrow or Thursday. I'm working on non-gabert stuff for most of today then I have to launch the OpusFilter run and try a RoBERTA run tomorrow but it shouldn't take me long to implement it when I get a chance to work on it.

jowagner commented 3 years ago

For huggingface's BPE tokeniser, I can confirm the answer is that no X+PUNCT tokens make it into the vocabulary:

# closely following
# https://huggingface.co/docs/tokenizers/python/latest/quicktour.html

name = 'tiny'

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(unk_token='[UNK]'))

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(
    special_tokens = '[UNK] [CLS] [SEP] [PAD] [MASK]'.split(),
)

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

files = [f'{name}-{split}.raw' for split in 'test train valid'.split()]

tokenizer.train(files, trainer)

tokenizer.save(f'tokeniser-{name}.json')

Corpus:

$ head tiny-*
==> tiny-test.raw <==
Confidential text in the test data .
This must not appear in the vocabulary.
123 456 secret text .

==> tiny-train.raw <==
Tokenisation 's trouble .
Another common abbrev. is e.g. that is used to start a list of examples .
There are many colours , e.g. red , green and blue .

==> tiny-valid.raw <==
This text is only in the validation data .
It should not appear in the vocabulary .

Entries containing full-stop:

$ grep -o -E '["][a-zA-Z.][a-zA-Z.]*["]' tokeniser-tiny.json | fgrep .
"."

I find it a bit worrying re clean experimental setup though that the validation and test data also informs the vocabulary:

$ grep -o -E '["][a-zA-Z.][a-zA-Z.]*["]' tokeniser-tiny.json | grep -E "(secret|validation)"
"validation"
"secret"

jbrry commented 3 years ago

I ran the pipeline with a sample of abbreviated tokens (punct.txt). UDPipe was splitting them, so I pasted them in again as unitary tokens as well.

I then set the vocab size to 30, which is around the number of characters in the vocabulary in punct.txt

I then increased the vocab size to 78 so larger pieces would be included. I've attached the WordPiece vocab files from those below. In both, there are no vocab items with characters and punctuation.

jowagner commented 3 years ago

Thanks. Can you run this again with just the second half only? Depending on how exactly the decision is made whether to include the longer units the presence of the split versions may suppress the longer ones.

jbrry commented 3 years ago

Sure, with just the second half:

jbrry / Irish-BERT

Inspect how BERT tokenization affects tokens which are composed of characters and punctuation #51