There will be a lot of characters that are not in the word piece vocabulary, especially when we limit the building of the vocabulary to the most clean sources and move to very different domains such as social media content that can use a rich and expanding set of emojis.
How are OOV characters handled when they occur in the input at training time?
Will the embedding table be expanded to include the new character?
Are the rarest characters mapped to a special word piece <UNK> to learn how to handle new characters that appear at test time?
If not, what other strategy is used to handle new characters at test time? For example, a possibility is to replace them with [MASK] to pretend one cannot see them. (At pre-training time, it probably would be a good idea to exclude such tokens from the loss.)
There will be a lot of characters that are not in the word piece vocabulary, especially when we limit the building of the vocabulary to the most clean sources and move to very different domains such as social media content that can use a rich and expanding set of emojis.
<UNK>
to learn how to handle new characters that appear at test time?[MASK]
to pretend one cannot see them. (At pre-training time, it probably would be a good idea to exclude such tokens from the loss.)