Open rajarsheem opened 6 years ago
Need attention, please!
@rajarsheem Yes, the outputs of the dynamic_decode in NMT codebase is the vocab logits.
If you don't give BasicDecoder an output_layer and using GreedyEmbeddingHelper, I think it will use RNN hidden states as the logits, and argmax on the hidden state to try to get a word id.
Not passing the output_layer is useful for using other helper, such as, ScheduledOutputTrainingHelper
You may implement a custom helper that takes the output layer, generate the vocab logits within the helper, and return both vocab logits and hidden states.
It's kind of strange that the hidden states are, by default, not exposed :/
"I think it will use RNN hidden states as the logits, and argmax on the hidden state to try to get a word id." It looks very undesirable but then I don't know why output_layer is not provided in the decoded here. Is it okay to leave like this -- using the hidden state argmax to pick next time step's input ?
@rajarsheem
During training, we can apply the output_layer after all time steps finished here because we have the word ids in target language. So the outputs here contains rnn outputs (which is the h state when using LSTM).
During inference, we have to pass the rnn outputs through the output layer at each time step to get the next word id.
@oahziur I don't get your first point. How can we compute the hidden states of all steps in the first place without using the output layer and taking the argmax to feed in next step input ?
In other words, how are we computing the outputs here (which is actually rnn states) without needing to feed output layer argmax (we cannot feed hidden state argmax, can we?)
@rajarsheem we don't feed hidden state argmax because we have the target ids. See how the TrainingHelper is created https://github.com/tensorflow/nmt/blob/master/nmt/model.py#L373.
Yeah, I get your point. But if I am not using teacher forcing (or using GreedyEmbedingHelper), I would want my predicted ids to be used. And for that to happen, I would be needing the output layer to be used as a part of the decoder.
@rajarsheem
Yes, the code you referenced in the last comment is only for teacher forcing during training, so that's why the output_layer is not being used.
So, I need to hack my way into it to use output_layer as a part of the decoder and also make the dynamic_decode return hidden states. Any suggestions about what should be the flow?
@rajarsheem
Yes, I think you can implement a custom GreedyEmebddingHelper (which accepts an output layer), so you don't need to pass-in the output layer to the BasicDecoder.
For example, You can insert code before here to convert the rnn_outputs to logits.
This is what I did: added a new attribute final_output
in BasicDecoderOutput
namedtuple that shall store projected outputs whenever there is an output_layer
in BasicDecoder
.
In the step()
of BasicDecoder
, final_outputs
which is linearly transformed cell_outputs
is what going inside sample
and also sent as a parameter to outputs
which is essentially a BasicDecoderOutput
and is returned. Few other changes were there.
Consequently, when dynamic_decode
is returning a BasicDecoderOutput
, it already has an attribute final_output
that has the unnormalized logits and rnn_output
being the cell output.
@oahziur @rajarsheem Could you help with a similar issue #298 ? Thanks.
+1 because users may have use cases where the decoder's outputs are needed without being passed through the final feedforward layer.
Example: the decoder uses scheduled sampling, so the dense layer is needed, but the user wants to use sampled softmax and hence needs the rnn outputs without being passed through the dense layer.
@rajarsheem I had to create an adapted version of BasicDecoder
for similar reasons.
Often one wants an RNN to output not only logits but also some additional loss terms or summary metrics. The functionality of theoutput_layer
is to extract logits from a cell output which is necessary for the mechanics in the step
function. However, if an output_layer
is present, right now step
then still only returns the logits (unfortunately called cell_outputs
as well) rather than the original outputs.
After I pass an explicit output layer like here, I see that that decoder outpus after
dynamic_decode
is the output distribution of size |V| where V is the vocab. How can I recover the decoder hidden states ?A follow up question: In tf.contrib.seq2seq.BasicDecoder, the outut_layer parameter is optional. Then, while doing Greedy Decoding, if I don't pass any value for that parameter, will it perform argmax on RNN hidden states and pass the output to the next time step decoder input (which is actually unintended) ?