google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.86k stars 9.56k forks source link

Tokenizer for BERT Pretrained models and punctuation #1047

Open ajmssc opened 4 years ago

ajmssc commented 4 years ago

The provided tokenizers split up punctuation into their own tokens and never get a chance to be picked up by the wordpiece tokenizer. Yet the pre-trained BERT models have in their vocabularies Wordpiece tokens for punctuation. For example ##., ##! and ##?.

Here is a colab example: https://colab.research.google.com/drive/18C8-7cZZt3N0QhQUSMCIziPzk2jlWFTF

Did Google use a different tokenizer when training the public models? Do these tokens have an actual impact in encoding text?

Example:

tokenizer = tokenization.FullTokenizer("/content/uncased_L-12_H-768_A-12/vocab.txt")

print(tokenizer.tokenize("Tokenizer example. The punctuation is separated before wordpiece processing."))

['token', '##izer', 'example', '.', 'the', 'pun', '##ct', '##uation', 'is', 'separated', 'before', 'word', '##piece', 'processing', '.']

Here the punctuation is split before wordpiece tokenization. But the vocabulary contains wordpiece tokens with punctuation.

print([token for token in tokenizer.vocab.keys() if token.startswith("##") and not any(c.isalnum() for c in token)])

... '##%', '##&', "##'", '##(', '##)', '##*', '##+', '##,', '##-', '##.', ...

heatxg commented 3 years ago

@ajmssc have you found any answers to this question? I am seeing a lot of variation in my predicted distributions with even minor punctuation perturbation, often resulting in a different predicted label altogether