pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 815 forks source link

Throw error on vocab.stoi[an_unseen_token] if '<unk>' not in specials #183

Open kylegao91 opened 6 years ago

kylegao91 commented 6 years ago

By default, a vocabulary's stoi maps any unknown token to the default id 0. However, since itos is initialized from specials, it would map both PAD and UNK to 0 when specials=['<pad>'], and in the worse case when specials=[], it'd map the most frequent token to 0. Please find a minimal reproducible example below:

a = [1, 1, 1, 2, 3]
counter = Counter(a)
vocab = torchtext.vocab.Vocab(counter, specials=[])
# vocab[5] -> 0  for unknown token
# vocab[1] -> 0  for the most frequent known token
jekbradbury commented 6 years ago

If the user omits an unknown token from specials, they're asserting that no unseen tokens will exist in the dataset. We should probably check this and throw an error when we try to index stoi with such a token, but other than lacking that error the behavior reported is essentially intentional. In particular, it's important that removing an unknown token from specials shifts the other tokens into the zero position, because the motivation for adding the option of no unknown token was to avoid having an unnecessary extra option in the softmax.