question of 'context vector' in seq2seq-translation/seq2seq-translation-batched.ipynb

spro / practical-pytorch

Go to https://github.com/pytorch/tutorials - this repo is deprecated and no longer maintained

MIT License

4.51k stars 1.1k forks source link

question of 'context vector' in seq2seq-translation/seq2seq-translation-batched.ipynb #81

Open seabay opened 6 years ago

seabay commented 6 years ago

Hi, all I have confusion about this:

decoder_hidden = encoder_hidden[:decoder_test.n_layers] # Use last (forward) hidden state from encoder,

should this be

decoder_hidden = encoder_hidden[decoder_test.n_layers:] ? Because now it is the hidden state of the second layer.

iamkissg commented 6 years ago

Hi @seabay

You might misunderstand what the hidden state of the 1st layer and the hidden state of the 2nd layer is.

Use encoder_hidden[:decoder_test.n_layers], we extract the normal time order hidden state --->, while encoder_hidden[decoder_test.n_layers:] gives us the reverse time order hidden state <---.

Though in my opinion, using which one might does not really matter, it's more common to use normal time order hidden state of Bi-RNN.

Hope it helps.

seabay commented 6 years ago

Hi @Engine-Treasure I think that number of layers is nothing related to Bi-direction, for example, encoder is a 2-layered Bi-RNN, so the hidden state has (2 * 2)=4 parts, the first two parts are the forward and backward state of layer 1, the last two parts are for layer 2.

So the question is: do we use the hidden state of the layer 1 or layer 2?

iamkissg commented 6 years ago

Hi @seabay

So sorry for my mistaking num_layers and hidden_size.

Then here comes another question, does forward hidden and backward hidden alternate in layers or forward hidden comes first?

[
layer0_forward
layer0_backward
layer1_forward
layer1_backward
] or
[
layer0_forward
layer1_forward
layer0_backward
layer1_backward
]

You can find some answers here

@spro 's answer is they alternates in layers. However the code which we're talking about seems does not match the answer.

I just get more confused, :(

seabay commented 6 years ago

hi @Engine-Treasure Based on my experiments, codes match the first layout which alternates in layers. But why @spro choose the first layer as the context vector for Decoder?

spro commented 6 years ago

This won't be a very satisfying answer, but I believe the reason is just that this is left over from a non-bidirectional encoder, and this slicing was a workaround to make it fit the decoder. The batched version is still very much a work in progress (despite the lack of recent progress).

Two better solutions would be:

Doubling the size of the decoder's hidden units to accept the whole set of encoder hidden states.
Summing the forward and backward encoder hidden states before feeding to decoder (similar to what is done with encoder outputs as the last step of the Encoder model).

zhongpeixiang commented 6 years ago

I have the same question and agree with the pattern 1 based on intuition. But through my experiments I found out the following:

https://discuss.pytorch.org/t/gru-output-and-h-n-relationship/12720

zhongpeixiang commented 6 years ago

@spro I think summing up forward and backward hidden state at the last position of encoder would not be a good idea because the backward hidden state at last position contains little information about the sentence.