Open kylegao91 opened 6 years ago
If the user omits an unknown token from specials
, they're asserting that no unseen tokens will exist in the dataset. We should probably check this and throw an error when we try to index stoi
with such a token, but other than lacking that error the behavior reported is essentially intentional. In particular, it's important that removing an unknown token from specials
shifts the other tokens into the zero position, because the motivation for adding the option of no unknown token was to avoid having an unnecessary extra option in the softmax.
By default, a vocabulary's stoi maps any unknown token to the default id 0. However, since itos is initialized from
specials
, it would map both PAD and UNK to 0 whenspecials=['<pad>']
, and in the worse case whenspecials=[]
, it'd map the most frequent token to 0. Please find a minimal reproducible example below: