Tokenizer for BERT Pretrained models and punctuation

The provided tokenizers split up punctuation into their own tokens and never get a chance to be picked up by the wordpiece tokenizer. Yet the pre-trained BERT models have in their vocabularies Wordpiece tokens for punctuation. For example ##., ##! and ##?.

Here is a colab example: https://colab.research.google.com/drive/18C8-7cZZt3N0QhQUSMCIziPzk2jlWFTF

Did Google use a different tokenizer when training the public models? Do these tokens have an actual impact in encoding text?

Example:

tokenizer = tokenization.FullTokenizer("/content/uncased_L-12_H-768_A-12/vocab.txt")

print(tokenizer.tokenize("Tokenizer example. The punctuation is separated before wordpiece processing."))

['token', '##izer', 'example', '.', 'the', 'pun', '##ct', '##uation', 'is', 'separated', 'before', 'word', '##piece', 'processing', '.']

Here the punctuation is split before wordpiece tokenization. But the vocabulary contains wordpiece tokens with punctuation.

print([token for token in tokenizer.vocab.keys() if token.startswith("##") and not any(c.isalnum() for c in token)])

... '##%', '##&', "##'", '##(', '##)', '##*', '##+', '##,', '##-', '##.', ...

google-research / bert

Tokenizer for BERT Pretrained models and punctuation #1047