The provided tokenizers split up punctuation into their own tokens and never get a chance to be picked up by the wordpiece tokenizer.
Yet the pre-trained BERT models have in their vocabularies Wordpiece tokens for punctuation. For example ##., ##! and ##?.
Did Google use a different tokenizer when training the public models?
Do these tokens have an actual impact in encoding text?
Example:
tokenizer = tokenization.FullTokenizer("/content/uncased_L-12_H-768_A-12/vocab.txt")
print(tokenizer.tokenize("Tokenizer example. The punctuation is separated before wordpiece processing."))
@ajmssc have you found any answers to this question? I am seeing a lot of variation in my predicted distributions with even minor punctuation perturbation, often resulting in a different predicted label altogether
The provided tokenizers split up punctuation into their own tokens and never get a chance to be picked up by the wordpiece tokenizer. Yet the pre-trained BERT models have in their vocabularies Wordpiece tokens for punctuation. For example
##.
,##!
and##?
.Here is a colab example: https://colab.research.google.com/drive/18C8-7cZZt3N0QhQUSMCIziPzk2jlWFTF
Did Google use a different tokenizer when training the public models? Do these tokens have an actual impact in encoding text?
Example:
['token', '##izer', 'example', '.', 'the', 'pun', '##ct', '##uation', 'is', 'separated', 'before', 'word', '##piece', 'processing', '.']
Here the punctuation is split before wordpiece tokenization. But the vocabulary contains wordpiece tokens with punctuation.
... '##%', '##&', "##'", '##(', '##)', '##*', '##+', '##,', '##-', '##.', ...