[RFC] Special symbols in torchtext.experimental.Vocab

This issue discusses the special symbols used in the experimental Vocabulary class.

Here is a quick search among the vocabulary classes in other libraries:

pytext.torchscript.vocab.ScriptVocabulary has attributes - unk, pad, eos, bos, mask (link)
fairseq.data.dictionary.Dictonary has attributes - unk, pad, bos, eos (link)
allennlp.data.vocabulary.Vocabulary has attributes - oov, pad (link)
torchtext.vocab has attributes - pad, unk (link)

For the current experimental vocab, it has a default unknown token <unk> which is used for the fallback scenario. If <unk> doesn't show up in the corpus during the vocabulary construction, it will be inserted to the beginning by default. The current design of the experimental vocabulary class doesn't have default tokens for padding, end of sentence, beginning of the sentence, and mask tokens.

Because of the confusion of the special symbols in the old vocabulary class (a.k.a. torchtext.vocab.Vocab), we would like to simplify the handling of special symbols. Here are some proposals for torchtext.experimental.vocab.Vocab:

Remove all the default special symbols, including <unk>, such that <unk> will not be added to the vocab instance automatically.
Add function (set_fallback_index) for the fallback index which is used for the scenario that tokens are not found; If users don't call this function explicitly, the vocabulary class will throw out error when tokens are not known for the vocabulary.
```
# 'random token' is not included in the corpus of the vocab
vocab('random token')
>>> Runtime Error: Unknown token. Call set_fullback_index() func
```

vocab.set_fullback_index(index=-1) vocab('random token')

-1

- For the special symbols (e.g. `'<unk>'`, `'<pad>'`), users should insert the tokens with the existing method `self.insert_token(token: str, index: int)`. Later on, when users need the index of the special symbols, they can obtain them by calling the vocab instance. For example:

vocab.insert_token('', 1) pad_idx = vocab('') print(pad_idx) 1

Add a fallback index (-1) which is used for the scenario that tokens are not found;

Suggestion: Let the user specify this via Vocab.set_fallback and disable by default (throw error)

Creates a method called add_unk(unk_symbol, index=0) and let users to add the unknown token after the construction. The unknown symbol will be linked to the fallback index, which is used for unfound token scenario.

Why not just use Vocab.insert or Vocab.append instead of add_unk?

vocab.add_unk('<unk>', index=0)
> vocab('<unk>')
>>> 0
vocab('random token')
>>> 0

I think this is not the same thing. The second should return "-1" if a fallback is set.

Add a fallback index (-1) which is used for the scenario that tokens are not found;

Suggestion: Let the user specify this via Vocab.set_fallback and disable by default (throw error)

Creates a method called add_unk(unk_symbol, index=0) and let users to add the unknown token after the construction. The unknown symbol will be linked to the fallback index, which is used for unfound token scenario.

Why not just use Vocab.insert or Vocab.append instead of add_unk?
vocab.add_unk('<unk>', index=0)
> vocab('<unk>')
>>> 0
vocab('random token')
>>> 0
I think this is not the same thing. The second should return "-1" if a fallback is set.

Thanks for the feedback. add_unk func is actually doing two things: 1. set fallback index; 2. insert <unk> token. I will remove the second step and create a function called set_fallback_index.

cc @bentrevett here is our proposal to address your feedback about the special symbols. Let us know if you have any other feedback about this issue. I will start working on the PR once we agree with the proposal.

Sounds good - but I believe <unk> and <pad> tokens are so commonly used that the default should be to have them in the vocab, but with the option to not add them, i.e. the Vocab should take in an unk_token argument which should default to "<unk>" a pad_token argument which should default to "<pad>". Then somewhere when creating the Vocab there should be something along the lines of:

if unk_token is not None:
    vocab.insert_token(unk_token, 0)
    vocab.set_fallback_index(index=0)
if pad_token is not None:
    vocab.insert_token(pad_token, 1 if unk_token is not None else 0)

Thus, if the user provides an unk_token of None then one isn't added and an error will be throw, and if the user wants to add any special tokens, i.e. sos/eos/mask/etc. then they can manually add them afterward.

I think not including an unk/pad token by default and thus forcing the user to manually add them seems counter-intuitive - if the user is going to add the unk/pad token the vast majority of the time then why doesn't torch text just handle it? - especially since almost every other NLP library has them in the vocabulary by default too.

sos/eos/mask tokens are less commonly used and therefore it's acceptable if they have to be explicitly added by the user. I think it's reasonable for a user to not assume these will already be in the vocabulary, whereas the opposite is true for the unk/pad - a user would reasonably assume these are already part of the vocabulary.

Thank you for your comment and all your contributions and feedback, Ben!

Maybe let's try to list advantages and disadvantages of adding unk/pad by default. Here are some ideas, please add to them if I'm missing something:

Advantages:

Prevents a user from forgetting to add <unk> or <pad>. For example, <unk> then is treated just like any other token and, if it shows up frequently enough, will become part of the Vocab. However, it won't be associated a special index such as "0" by default, which might break downstream assumptions.
More in line with other libraries, which have some special treatment for the <unk>, <pad> and related tokens by default.
This usually leads to the expected behavior and could make it easier for new users to get started.

Disadvantages:

Coupling tokenization conventions with Vocab design. This also introduces the notion that the Vocab should be responsible for handling something like this due to some kind of intrinsic property.
Slight differences in expectations for how <unk> or <pad> are meant to be treated (see Example 1). These can matter a lot and cause frustrating bugs, potentially more frustrating than having to have added these tokens manually.
Libraries that might want to adopt our Vocab as a backend for theirs (for various reasons such as JIT-ability or performance or to simply shed the responsibility of having to maintain it) will need to map their current behavior onto our Vocab. More complexity can make it harder to see the work that needs to be done. Of course, on the other hand, more defaults might make it likely to save some work. I doubt though that a few extra lines is what would make it harder to use this library than unexpected behavior. For default <unk> and <pad> tokens this probably isn't an issue, but what if we continue this pattern and introduce further tokens / features, that we might not anticipate right now and will be needed because we added <unk> and <pad> support to our Vocab.

Example 1:

if unk_token is not None:
    vocab.insert_token(unk_token, 0)
    vocab.set_fallback_index(index=0)

This makes the assumption that the unk_token index is also meant to be used as a fallback when the given token was out of Vocabulary. It also introduces a fallback to begin with. Maybe the user would prefer an error to be thrown if a given token that is considered OOV wasn't marked as . This could be useful if there is an error in the tokenization and the given token is actually never meant to be produced (e.g. text encoding issues or using a new, improved implementation of an existing tokenizer). What behavior should we choose and why and how exactly do the other libraries do this and what's the best standard? Having the user define this herself should prevent this.

Thanks for the comment - good to see your insight into how you see the vocab class being used.

How about instead of the unk/pad/special_tokens being passed to the Vocab object a build_vocab function is responsible for adding the unk/pad/special_tokens? This is basically the torchtext.experimental.vocab.vocab function with a few small changes.

def build_vocab(ordered_dict, min_freq=1, unk_token='<unk>', special_tokens['<pad>']):

    tokens = []
    for token, freq in ordered_dict.items():
        if freq >= min_freq:
            tokens.append(token)

    vocab = Vocab(VocabPybind(tokens))

    # some handling to see if unk_token or any of special_tokens 
    # are already in the vocab should go somewhere in this function?
    # although it is probably better to explicitly throw an error if they are already in the vocab
    # as that is probably not expected (?)

    if unk_token is not None:
        vocab.insert_token(unk_token, 0)
        vocab.set_fallback_index(index = 0)
    for i, token in enumerate(special_tokens):
        vocab.insert_token(token, i if unk_token is None else i + 1)

    return vocab

EDIT Although now I see your point with the allowing the user to handle this as deciding whether to throw an error or not if the unk/pad/special_tokens are already in the Vocab depends on the problem. The option then is to decide if torchtext wants to choose a default, i.e. throw an error, and if the user wants the alternative - not throw an error - then they have to write build_vocab themselves OR not choose a default and always leave it up to the user.

If dealing with your own raw data then it shouldn't have any unks/pads/special_tokens in it and thus should throw an error, but if you're dealing with some preprocessed data, i.e. WikiText which already has been unk'd then you shouldn't throw an error but should still set the fallback index. So this implies we should probably be leaving it to the user? END EDIT

This keeps the advantages you mentioned as it prevents the user from forgetting to add unk/pad, keeps it in line with other libraries, and leads to expected behavior. It also helps with the disadvantages:

less coupling of tokenization and vocab as now the coupling is between tokenization and build_vocab function
unk/pad are still treated as normal tokens in the default Vocab object and the extra functionality is added by the build_vocab function
allows other libraries to adopt torchtext's Vocab object and they only have to write their own build_vocab-esque function
other unseen functionality that arises can be added to the build_vocab functions and the Vocab object can remain unchanged

You can then have other functions, such as build_vocab_from_iterator take kwargs which it can pass to build_vocab.

def build_vocab_from_iterator(iterator, **kwargs):
    counter = Counter()
    for tokens in iterator:
        counter.update(tokens)
    sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    ordered_dict = OrderedDict(sorted_by_freq_tuples)
    word_vocab = build_vocab(ordered_dict, **kwargs)
    return word_vocab

Again, this keeps the sensible defaults and moves all of the configurations out of the Vocab object.

Downsides? This might be just shifting the disadvantages from Vocab to other functions. Not everyone is a fan of adding kwargs to everything which just get passed down through layers of abstractions. Not really that much code for the user to write their own build_vocab function.

The other option, as you've mentioned, would be to have every user write their own function that wraps around the torchtext functions, e.g. build_vocab_from_iterator. A quick example of how I'd probably end up using it (in case that's useful for the discussion) is:

def my_build_vocab_from_iterator(iterator, min_freq, unk_token, pad_token, special_tokens):
    # NOTE: i'd also like some way to set the max_size of the vocab somewhere
    vocab = torchtext.experimental.vocab.build_vocab_from_iterator(iterator, min_freq) # assuming unk_token is removed from here
    # ignoring checks to see if unk/pad/special_tokens are already in the vocab
    if unk_token is not None:
        vocab.insert(unk_token, 0)
        vocab.set_fallback_index(index=0)
    if pad_token is not None:    
        vocab.insert(pad_token, 0 if unk_token is None else 1)
    special_token_offset = sum([x is not None for x in [unk_token, pad_token])
    for i, token in enumerate(special_tokens):
        vocab.insert(token, i + special_token_offset)
    return vocab

But, if I'm going to write code along these lines, i.e. to add unk/pad/special_tokens, for every single NLP project, then should it not be part of torchtext? Obviously, I am not saying just because I'd write it this way it means everyone else will and thus it should be a default. I'm just saying how I would use it and so am biased towards that.

But then again, it's not much code so its fine if its left to the user?

Let's see this problem from another view (a.k.a SentencePiece) as SentencePiece supports encoding as a "Vocabulary". By default

SentencePiece has <unk> token indexed as "0" for the fallback scenario.
SentencePiece has reserved control symbols: unknown, BOS, EOS, PAD tokens have the default index of 0, 1, 2, 3, respectively. Users has the options to reset UNK, BOS, EOS, PAD index. Except for UNK token, the other reserved symbols can be reset to "-1" such that they are removed from the SP model.
If users have additional special symbols, for example '<mask>', '<sep>', '<cls>', those user-defined symbols are handled as one piece in any context symbols, and must be inserted to the SP model manually.

In a word, for SentencePiece, there are two categories of special symbols: reserved vs user-defined

Are there technical reasons that SentencePiece chooses this behavior? I think that's a good question for the other libraries as well. Outside of user convenience, maybe this is something the library developers chose to maybe address performance concerns or to maintain a cleaner implementation.

Thanks for the comment - good to see your insight into how you see the vocab class being used. [...] But then again, it's not much code so its fine if its left to the user?

I think we're pretty much on the same page in terms of being conflicted about having a helper function (e.g. build_vocab) with some defaults vs. requiring he user to write extra code to add in the unk token etc., which she'll probably have to do during each vocab construction.

I do wonder though if, in the end, the complexity that arises from adding all these extra keyword arguments are going to match or out-match the complexity from requiring the user to e.g. add an unk token and fallback index. Further, I'd expect that setting up a vocab is mostly a once-per-dataset decision which will then be parameterized and varied across experiments etc. This is again where transparency and configurability will come in handy, which might be easier to convey via a set of orthogonal functions / methods instead of flags.

On a bit of contrived note: I think it's a very good idea to add the 'max_size' flag to the vocab factory function. Without having to spent much thought about special characters it's obvious how this post-condition is defined and what the algorithm might need to do to achieve this. But what if we have special characters? Will they be included in this count or excluded? It's a minor point and surely both choices are reasonable in their own right, but they'll require documentation and explanation. 'max_size' is a basic condition, but what if we'll come up with more complex means of Vocab construction?

Thanks for the comment - good to see your insight into how you see the vocab class being used. [...] But then again, it's not much code so its fine if its left to the user?

I think we're pretty much on the same page in terms of being conflicted about having a helper function (e.g. build_vocab) with some defaults vs. requiring he user to write extra code to add in the unk token etc., which she'll probably have to do during each vocab construction.

I do wonder though if, in the end, the complexity that arises from adding all these extra keyword arguments are going to match or out-match the complexity from requiring the user to e.g. add an unk token and fallback index. Further, I'd expect that setting up a vocab is mostly a once-per-dataset decision which will then be parameterized and varied across experiments etc. This is again where transparency and configurability will come in handy, which might be easier to convey via a set of orthogonal functions / methods instead of flags.

I agree with you here, and I am now definitely now leaning more towards letting the user explicitly configure this all for themselves, i.e. setting their own fallback index, etc. Might be difficult for users coming from other libraries to "get" this at first but I'm sure with decent documentation and examples it will be fine.

On a bit of contrived note: I think it's a very good idea to add the 'max_size' flag to the vocab factory function. Without having to spent much thought about special characters it's obvious how this post-condition is defined and what the algorithm might need to do to achieve this. But what if we have special characters? Will they be included in this count or excluded? It's a minor point and surely both choices are reasonable in their own right, but they'll require documentation and explanation. 'max_size' is a basic condition, but what if we'll come up with more complex means of Vocab construction?

My default way of thinking is that special tokens should be excluded from this count - so if you set a max_size to 30k and have 10 special tokens then your vocabulary size will then have 30,010 tokens. This seems intuitive if the user has to explicitly configure their vocabulary, "well I've set the maximum size to X and I do add_token ten times then my vocabulary size must be X+10". I don't think doing something like making add_token be "aware" of some sort of max_size parameter and then popping tokens off the end to keep the vocabulary to max_size is very intuitive - the method is called add_token so it doesn't make sense if it also removes tokens.

The problem with only having a min_freq argument is that if you increase the min_freq from e.g. 1 to 2 then your vocabulary size will usually change significantly, in the region of thousands of tokens - whereas having a max_size at least ensures your vocabulary will always be in the ballpark of max_size unless you're adding thousands of special tokens which I don't think is common(?)

Then again, you could also make the argument that the user could manually create the OrderedDict to obey their desired max_size and then pass that to the experimental.vocab.vocab function themselves. But if we're taking that argument then you could also point out that they could do the same thing with min_freq and that experimental.vocab.vocab should just take an OrderedDict only - no other arguments - and create a Vocab based on that, ~~in which case it doesn't even need to be an OrderedDict anymore, it could just be an iterable filled with strings (tokens used to populate the vocabulary)~~ nevermind, it does need to be an OrderedDict as it needs to have no duplicates and maintain order.

pytorch / text

[RFC] Special symbols in torchtext.experimental.Vocab #1016