Closed jowagner closed 3 years ago
Our recent gaBERT from scratch model does not pick any X+PUNCT tokens for the vocabulary:
$ grep -E '[.?!]' vocab.txt
.
?
!
..
...
##.
##?
##!
Neither does mBERT:
$ grep -E '[.?!]' multi_cased_L-12_H-768_A-12/vocab.txt
!
.
?
...
##!
##.
##?
It is surprising that frequent abbreviations do not make it into the vocabulary.
Do these abbreviations occur in the X+PUNCT format in the raw text files before they are converted to the vocab?
See pretraining data from the latest run here: https://drive.google.com/drive/u/0/folders/1O9XEI4osV9Zr8MzheujLebRASqaZIaoK
After untarring the above file, you should be able to see the tokenized sentences in the below directory:
cd conll17_gdrive_NCI_oscar_paracrawl_filtering_None/ga/tokenized-texts
All tokenised corpora contain plenty of X+PUNCT tokens:
$ for P in c g N o p w ; do echo == $P == ; grep -h -E -o " [A-Za-z][A-Za-z.]*[.] " ${P}* | \
LC_ALL=C sort | uniq -c | sort -n | tail -n 5 ; done
== c ==
741 F.C.
755 J.
2207 Co.
3047 b.
4590 r.
== g ==
654 etc.
778 T.D.
1221 I.R.
1766 lch.
2424 Uimh.
== N ==
1496 e.g.
1626 i.e.
3181 Co.
3761 lch.
5113 Uimh.
== o ==
161 c.s.
178 Uimh.
300 srl.
355 D.
897 Co.
== p ==
2422 lch.
2976 etc.
3621 Co.
10831 Lch.
17228 Uimh.
== w ==
244 F.C.
270 p.
359 J.
587 ll.
1328 lch.
Either these frequencies are not high enough for inclusion in the vocabulary or the tools that build the vocabulary carry out additional tokenisation. Trying to get BERT vectors for tokenised text for dependency parsing, I notice that huggingface's BERT interface splits at all whitespace and all punctuation characters and there is no option to switch this off:
from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained('bert-base-uncased')
example_batch = [
'hello world !'.split(),
"""tokenisation 's trouble""".split(),
]
tokenised_text = tokeniser(
example_batch,
# pre-tokenised input
is_split_into_words = True, # TODO: this doesn't seem to do what we expect it to do
)
print('converted back:')
for i, token_ids in enumerate(tokenised_text['input_ids']):
print(i, tokeniser.convert_ids_to_tokens(token_ids))
producing
converted back:
0 ['[CLS]', 'hello', 'world', '!', '[SEP]']
1 ['[CLS]', 'token', '##isation', "'", 's', 'trouble', '[SEP]']
The apostrophe-s should not be split without "##" glue. Some tokenisation is going on before the tokens are split into word pieces. Looking at the source code, it seems that BERT always splits at whitespace and non-alpha-numeric characters, called "punctuation" in the code. (For the purpose of getting the vector of the first wordpiece of each of my tokens I can hack it by calling BERT's tokeniser for each of my tokens separately and tracing what comes out. Also wonder what BERT-based sequence taggers and dependency parsers do.)
If this is what BERT does it only makes sense that the tools for building a BERT vocabulary do the same, i.e. aggressively split any non-alpha-numeric characters from any sequences of alpha-numeric characters, before collecting character sequence statistics and deciding what sequences to include in the vocabulary.
If all major BERT implementations split at non-alpha-numeric characters there is no point in trying to fix this and also any improvements in tokenisation are a waste of time as all non-trivial tokenisation issues involve non-alpha-numeric characters and whatever we do BERT will change it.
Yes, it seems there is some additional tokenization procedure going on apart from running a typical, UDPipe, SpaCy, moses tokenizer etc.:
BERT's Tokenizer WikiBERT's berttokenizer
For sequence taggers and dependency parsers, which require 1 input per word, there are tools to create offsets between the tokenized input and the word pieces which map an index from the tokenized sentence to its wordpieces. Then those wordpieces can be averaged for each token or take the first one.
The implementation in AllenNLP can be found here: https://github.com/allenai/allennlp/blob/d0a07fb32811c649185ee99d71373cc7cab8791e/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py
It looks like each implementation is independent but does the same thing, e.g. google's _run_split_on_punc()
calls a _is_punctuation()
function that uses the same definition of punctuation as huggingface. Hopefully, AllenNLP does the same. If not, there may be a small performance degradation when using our AllenNLP-trained gaBERT model with huggingface or google libraries.
As to reconstructing the original tokenisation, the code you linked shows that each word piece has a text_id
attribute that tells users what input token it comes from. To check how it is assigned, we'd have to check the implementation of intra_word_tokenize()
. In huggingface's run_ner.py
example, I see transformers has the info in tokenised_text.word_ids(batch_index = i)
. I added it to my test code above and it looks good.
The only open question for this issue is then whether the vocabulary builder also applies this aggressive tokenisation. A simple test would be to run it on a toy dataset that contains an X+PUNCT token and see whether it show up in the vocabulary, setting the vocabulary size to the number of characters in the input so that it cannot be fully used. Can you do this?
The only open question for this issue is then whether the vocabulary builder also applies this aggressive tokenisation. A simple test would be to run it on a toy dataset that contains an X+PUNCT token and see whether it show up in the vocabulary, setting the vocabulary size to the number of characters in the input so that it cannot be fully used. Can you do this?
Good idea. Yes, I can do this but probably can't take a look at it until tomorrow or Thursday. I'm working on non-gabert stuff for most of today then I have to launch the OpusFilter run and try a RoBERTA run tomorrow but it shouldn't take me long to implement it when I get a chance to work on it.
For huggingface's BPE tokeniser, I can confirm the answer is that no X+PUNCT tokens make it into the vocabulary:
# closely following
# https://huggingface.co/docs/tokenizers/python/latest/quicktour.html
name = 'tiny'
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token='[UNK]'))
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(
special_tokens = '[UNK] [CLS] [SEP] [PAD] [MASK]'.split(),
)
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
files = [f'{name}-{split}.raw' for split in 'test train valid'.split()]
tokenizer.train(files, trainer)
tokenizer.save(f'tokeniser-{name}.json')
Corpus:
$ head tiny-*
==> tiny-test.raw <==
Confidential text in the test data .
This must not appear in the vocabulary.
123 456 secret text .
==> tiny-train.raw <==
Tokenisation 's trouble .
Another common abbrev. is e.g. that is used to start a list of examples .
There are many colours , e.g. red , green and blue .
==> tiny-valid.raw <==
This text is only in the validation data .
It should not appear in the vocabulary .
Entries containing full-stop:
$ grep -o -E '["][a-zA-Z.][a-zA-Z.]*["]' tokeniser-tiny.json | fgrep .
"."
I find it a bit worrying re clean experimental setup though that the validation and test data also informs the vocabulary:
$ grep -o -E '["][a-zA-Z.][a-zA-Z.]*["]' tokeniser-tiny.json | grep -E "(secret|validation)"
"validation"
"secret"
I ran the pipeline with a sample of abbreviated tokens (punct.txt). UDPipe was splitting them, so I pasted them in again as unitary tokens as well.
I then set the vocab size to 30, which is around the number of characters in the vocabulary in punct.txt
I then increased the vocab size to 78 so larger pieces would be included. I've attached the WordPiece vocab files from those below. In both, there are no vocab items with characters and punctuation.
Thanks. Can you run this again with just the second half only? Depending on how exactly the decision is made whether to include the longer units the presence of the split versions may suppress the longer ones.
Since inconsistencies in the tokenisation are hard to avoid when working with corpora from different sources, it may help the final model to force tokens like "etc." to be split into two word pieces by removing vocabulary entries X+PUNCT if X is in the vocabulary and replacing X+PUNCT with X if not before we train BERT, in particular if the user's tokeniser splits more aggressively than our tokenisers.