tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.37k stars 1.96k forks source link

get the decoder hidden states after decoding #170

Open rajarsheem opened 6 years ago

rajarsheem commented 6 years ago

After I pass an explicit output layer like here, I see that that decoder outpus after dynamic_decode is the output distribution of size |V| where V is the vocab. How can I recover the decoder hidden states ?

A follow up question: In tf.contrib.seq2seq.BasicDecoder, the outut_layer parameter is optional. Then, while doing Greedy Decoding, if I don't pass any value for that parameter, will it perform argmax on RNN hidden states and pass the output to the next time step decoder input (which is actually unintended) ?

rajarsheem commented 6 years ago

Need attention, please!

oahziur commented 6 years ago

@rajarsheem Yes, the outputs of the dynamic_decode in NMT codebase is the vocab logits.

If you don't give BasicDecoder an output_layer and using GreedyEmbeddingHelper, I think it will use RNN hidden states as the logits, and argmax on the hidden state to try to get a word id.

Not passing the output_layer is useful for using other helper, such as, ScheduledOutputTrainingHelper

You may implement a custom helper that takes the output layer, generate the vocab logits within the helper, and return both vocab logits and hidden states.

rajarsheem commented 6 years ago

It's kind of strange that the hidden states are, by default, not exposed :/

rajarsheem commented 6 years ago

"I think it will use RNN hidden states as the logits, and argmax on the hidden state to try to get a word id." It looks very undesirable but then I don't know why output_layer is not provided in the decoded here. Is it okay to leave like this -- using the hidden state argmax to pick next time step's input ?

oahziur commented 6 years ago

@rajarsheem

During training, we can apply the output_layer after all time steps finished here because we have the word ids in target language. So the outputs here contains rnn outputs (which is the h state when using LSTM).

During inference, we have to pass the rnn outputs through the output layer at each time step to get the next word id.

rajarsheem commented 6 years ago

@oahziur I don't get your first point. How can we compute the hidden states of all steps in the first place without using the output layer and taking the argmax to feed in next step input ?

In other words, how are we computing the outputs here (which is actually rnn states) without needing to feed output layer argmax (we cannot feed hidden state argmax, can we?)

oahziur commented 6 years ago

@rajarsheem we don't feed hidden state argmax because we have the target ids. See how the TrainingHelper is created https://github.com/tensorflow/nmt/blob/master/nmt/model.py#L373.

rajarsheem commented 6 years ago

Yeah, I get your point. But if I am not using teacher forcing (or using GreedyEmbedingHelper), I would want my predicted ids to be used. And for that to happen, I would be needing the output layer to be used as a part of the decoder.

oahziur commented 6 years ago

@rajarsheem

Yes, the code you referenced in the last comment is only for teacher forcing during training, so that's why the output_layer is not being used.

rajarsheem commented 6 years ago

So, I need to hack my way into it to use output_layer as a part of the decoder and also make the dynamic_decode return hidden states. Any suggestions about what should be the flow?

oahziur commented 6 years ago

@rajarsheem

Yes, I think you can implement a custom GreedyEmebddingHelper (which accepts an output layer), so you don't need to pass-in the output layer to the BasicDecoder.

For example, You can insert code before here to convert the rnn_outputs to logits.

rajarsheem commented 6 years ago

This is what I did: added a new attribute final_output in BasicDecoderOutput namedtuple that shall store projected outputs whenever there is an output_layer in BasicDecoder. In the step() of BasicDecoder, final_outputs which is linearly transformed cell_outputs is what going inside sample and also sent as a parameter to outputs which is essentially a BasicDecoderOutput and is returned. Few other changes were there.

Consequently, when dynamic_decode is returning a BasicDecoderOutput, it already has an attribute final_output that has the unnormalized logits and rnn_output being the cell output.

peterpan2018 commented 6 years ago

@oahziur @rajarsheem Could you help with a similar issue #298 ? Thanks.

danielwatson6 commented 6 years ago

+1 because users may have use cases where the decoder's outputs are needed without being passed through the final feedforward layer.

Example: the decoder uses scheduled sampling, so the dense layer is needed, but the user wants to use sampled softmax and hence needs the rnn outputs without being passed through the dense layer.

schmiflo commented 5 years ago

@rajarsheem I had to create an adapted version of BasicDecoder for similar reasons.

Often one wants an RNN to output not only logits but also some additional loss terms or summary metrics. The functionality of theoutput_layer is to extract logits from a cell output which is necessary for the mechanics in the step function. However, if an output_layer is present, right now step then still only returns the logits (unfortunately called cell_outputs as well) rather than the original outputs.