Closed PetrochukM closed 7 years ago
It's important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.
Hmm, in seq2seq I don't think you're supposed to append the EOS and BOS tokens to the source side (see implementation in opennmt)? I don't think having the same pad tokens really matters, since you should be masking it out anyway / you should stop decoding when you see a <PAD>
?
@nelson-liu I think you are right. Going to close this issue.
@nelson-liu Should the decoder predict <PAD>
tokens in batching? Where you are running the decoder on a batch with variable sized target lengths?
Accessing the index of padding is very verbose:
PAD_TOKEN='<pad>'
padding_idx = iter.dataset.fields['tgt'].vocab.stoi[PAD_TOKEN]
During training, it's convenient to pass the iterator to a trainer. Along with the iterator, it's critical to know the index of the pad token for calculating the loss function.
Knowing this is required as part of seq2seq, I think this should be less verbose. For example Vocab.PAD_TOKEN_IDX
.
Should the decoder predict tokens in batching? Where you are running the decoder on a batch with variable sized target lengths?
yes, batch decoding is a lot faster but it's also not trivial to implement beam search in that context (obviously moot if you just want to use greedy search).
I think this should be less verbose.
Fair enough, i think there's a case to be made for constants for {PAD/EOS/INIT}_TOKEN
/ {PAD/EOS/INIT}_TOKEN_IDX
They're not really global constants, though; they're specific to a vocabulary. They could be added as attributes to the vocab, or even to the field, but that wouldn't make the access more than a few characters shorter.
I guess the fact that <unk>
is 0 and <pad>
-- if it exists -- is 1 is guaranteed, so those could be global constants. Or you could just use the literals zero and one or set those as constants in your own code.
They're not really global constants, though; they're specific to a vocabulary. They could be added as attributes to the vocab
Sorry, the all-caps was misleading. I was suggesting them as attributes to the vocabulary.
But yes, i generally save the field object anyway so the access isn't that long (e.g. self.field.vocab.stoi["<pad>/<eos>/<bos>"]
) and it's just a one-time call, so maybe not so important
Or you could just use the literals zero and one or set those as constants in your own code.
This would be a bit iffy code design wise because then the code has to trust those will not change in TorchText.
For a production system, you'd need to wrap those in asserts
to make sure everything is kosher.
attributes to the vocab
Looking for them to be constant for every vocabulary like Vocab.PAD_IDX
instead of needing to access an instance of the Vocab
. Class constants are what I think is the best bet.
one-time call, so maybe not so important
Yes. It's a one-time call to save them. Afterward, it's verbose to pass them as parameters to the model, trainer, loss function, etc...
Part of reducing the amount of verbosity is to not have to pass them as parameters to model, trainer, loss function, etc...
I usually pass the vocab instance to my model constructor, then reference it like this:
outs = Variable(encoding.data.new(B, T).long().fill_(
self.vocab.stoi['<pad>']))
Tl;dr Let's reduce the code complexity required to use this library. Let's take inspiration from OpenNMT on using constants for PAD, EOS, SOS, and UNK.
NOTE: I'd be happy to submit a pull request but i am having a difficult time getting my company to approve it.
Heavy use a seq2seq framework (Mine & OpenNMT)
I usually pass the vocab instance to my model constructor, then reference it like this:
Same with me. For me, it's still going to be a tad more verbose. Need to pass in a constant for <pad>
that is saved for me in a SequentialField
object.
This is a code snippet from me:
self.padding_idx = self.vocab.stoi[SequentialField.PAD_TOKEN]
self.eos_idx = self.vocab.stoi[SequentialField.EOS_TOKEN]
self.sos_idx = self.vocab.stoi[SequentialField.SOS_TOKEN]
For embeddings decoder:
self.embedding = nn.Embedding(
self.output_size, self.hidden_size, padding_idx=self.padding_idx)
For embeddings encoder:
self.embedding = nn.Embedding(len(self.vocab), hidden_size, padding_idx=self.padding_idx)
For accuracy:
num_correct_tokens = predictions.data.eq(targets.data).masked_select(
targets.ne(padding_idx).data).sum()
total_tokens = targets.data.ne(padding_idx).sum()
For loss:
mask = output_field.vocab.stoi[SequentialField.PAD_TOKEN]
loss = Perplexity(weight, mask)
The overall idea is, the PAD_IDX is heavily used for me. OpenNMT also has constants for PAD, SOS, EOS, and UNK. They use PAD 10, UNK 5, EOS 5, and SOS 4 times for different steps.
Ref: https://github.com/OpenNMT/OpenNMT-py/search?utf8=%E2%9C%93&q=onmt.Constants.PAD&type=
Without a constant, it needs to be passed around as an argument to different parts of the code base loss, model, etc..
i feel like it'd be a mistake to design this library to mimic opennmt's data-handling utils / focus on the seq2seq application. I personally don't really see the need to have it as a constant (it's not too hard to reference self.field.vocab
anyway). frankly i don't think the verbosity is an issue, as long as it's keeping it clear.
Okay! Thanks for your input!
Context Special tokens are frequently used for masking or padding or interpreting the model. It's important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.
Problem Creating two fields, one for French and one for English, there are no class constants for the index of
eos_token
. The only way to find out the index ofeos_token
is per instance of the class (etc.self.stoi[eos_token]
).The code by default is not designed to guarantee that the French dictionary has the same EOS index as the English dictionary.
Possible Solution A With setting the optional parameter 'eos_token' would it be possible to set 'eos_token_index'?
Possible Solution B Vocab or Field constant for the index of special tokens.