pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.52k stars 812 forks source link

Special token index is verbose. #74

Closed PetrochukM closed 7 years ago

PetrochukM commented 7 years ago

Context Special tokens are frequently used for masking or padding or interpreting the model. It's important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.

Problem Creating two fields, one for French and one for English, there are no class constants for the index of eos_token. The only way to find out the index of eos_token is per instance of the class (etc. self.stoi[eos_token]).

The code by default is not designed to guarantee that the French dictionary has the same EOS index as the English dictionary.

Possible Solution A With setting the optional parameter 'eos_token' would it be possible to set 'eos_token_index'?

Possible Solution B Vocab or Field constant for the index of special tokens.

nelson-liu commented 7 years ago

It's important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.

Hmm, in seq2seq I don't think you're supposed to append the EOS and BOS tokens to the source side (see implementation in opennmt)? I don't think having the same pad tokens really matters, since you should be masking it out anyway / you should stop decoding when you see a <PAD>?

PetrochukM commented 7 years ago

@nelson-liu I think you are right. Going to close this issue.

PetrochukM commented 7 years ago

@nelson-liu Should the decoder predict <PAD> tokens in batching? Where you are running the decoder on a batch with variable sized target lengths?

PetrochukM commented 7 years ago

Accessing the index of padding is very verbose:

PAD_TOKEN='<pad>'
padding_idx = iter.dataset.fields['tgt'].vocab.stoi[PAD_TOKEN]

During training, it's convenient to pass the iterator to a trainer. Along with the iterator, it's critical to know the index of the pad token for calculating the loss function.

Knowing this is required as part of seq2seq, I think this should be less verbose. For example Vocab.PAD_TOKEN_IDX.

nelson-liu commented 7 years ago

Should the decoder predict tokens in batching? Where you are running the decoder on a batch with variable sized target lengths?

yes, batch decoding is a lot faster but it's also not trivial to implement beam search in that context (obviously moot if you just want to use greedy search).

I think this should be less verbose.

Fair enough, i think there's a case to be made for constants for {PAD/EOS/INIT}_TOKEN / {PAD/EOS/INIT}_TOKEN_IDX

jekbradbury commented 7 years ago

They're not really global constants, though; they're specific to a vocabulary. They could be added as attributes to the vocab, or even to the field, but that wouldn't make the access more than a few characters shorter. I guess the fact that <unk> is 0 and <pad> -- if it exists -- is 1 is guaranteed, so those could be global constants. Or you could just use the literals zero and one or set those as constants in your own code.

nelson-liu commented 7 years ago

They're not really global constants, though; they're specific to a vocabulary. They could be added as attributes to the vocab

Sorry, the all-caps was misleading. I was suggesting them as attributes to the vocabulary.

But yes, i generally save the field object anyway so the access isn't that long (e.g. self.field.vocab.stoi["<pad>/<eos>/<bos>"]) and it's just a one-time call, so maybe not so important

PetrochukM commented 7 years ago

Or you could just use the literals zero and one or set those as constants in your own code.

This would be a bit iffy code design wise because then the code has to trust those will not change in TorchText.

For a production system, you'd need to wrap those in asserts to make sure everything is kosher.

attributes to the vocab

Looking for them to be constant for every vocabulary like Vocab.PAD_IDX instead of needing to access an instance of the Vocab. Class constants are what I think is the best bet.

one-time call, so maybe not so important

Yes. It's a one-time call to save them. Afterward, it's verbose to pass them as parameters to the model, trainer, loss function, etc...

Part of reducing the amount of verbosity is to not have to pass them as parameters to model, trainer, loss function, etc...

jekbradbury commented 7 years ago

I usually pass the vocab instance to my model constructor, then reference it like this:

outs = Variable(encoding.data.new(B, T).long().fill_(
    self.vocab.stoi['<pad>']))
PetrochukM commented 7 years ago

Tl;dr Let's reduce the code complexity required to use this library. Let's take inspiration from OpenNMT on using constants for PAD, EOS, SOS, and UNK.

NOTE: I'd be happy to submit a pull request but i am having a difficult time getting my company to approve it.


Heavy use a seq2seq framework (Mine & OpenNMT)

I usually pass the vocab instance to my model constructor, then reference it like this:

Same with me. For me, it's still going to be a tad more verbose. Need to pass in a constant for <pad> that is saved for me in a SequentialField object.

This is a code snippet from me:

        self.padding_idx = self.vocab.stoi[SequentialField.PAD_TOKEN]
        self.eos_idx = self.vocab.stoi[SequentialField.EOS_TOKEN]
        self.sos_idx = self.vocab.stoi[SequentialField.SOS_TOKEN]

For embeddings decoder:

        self.embedding = nn.Embedding(
            self.output_size, self.hidden_size, padding_idx=self.padding_idx)

For embeddings encoder:

        self.embedding = nn.Embedding(len(self.vocab), hidden_size, padding_idx=self.padding_idx)

For accuracy:

num_correct_tokens = predictions.data.eq(targets.data).masked_select(
            targets.ne(padding_idx).data).sum()
        total_tokens = targets.data.ne(padding_idx).sum()

For loss:

    mask = output_field.vocab.stoi[SequentialField.PAD_TOKEN]
    loss = Perplexity(weight, mask)

The overall idea is, the PAD_IDX is heavily used for me. OpenNMT also has constants for PAD, SOS, EOS, and UNK. They use PAD 10, UNK 5, EOS 5, and SOS 4 times for different steps.

Ref: https://github.com/OpenNMT/OpenNMT-py/search?utf8=%E2%9C%93&q=onmt.Constants.PAD&type=

Without a constant, it needs to be passed around as an argument to different parts of the code base loss, model, etc..

nelson-liu commented 7 years ago

i feel like it'd be a mistake to design this library to mimic opennmt's data-handling utils / focus on the seq2seq application. I personally don't really see the need to have it as a constant (it's not too hard to reference self.field.vocabanyway). frankly i don't think the verbosity is an issue, as long as it's keeping it clear.

PetrochukM commented 7 years ago

Okay! Thanks for your input!