spro / practical-pytorch

Go to https://github.com/pytorch/tutorials - this repo is deprecated and no longer maintained
MIT License
4.51k stars 1.1k forks source link

question of 'context vector' in seq2seq-translation/seq2seq-translation-batched.ipynb #81

Open seabay opened 6 years ago

seabay commented 6 years ago

Hi, all I have confusion about this:

decoder_hidden = encoder_hidden[:decoder_test.n_layers] # Use last (forward) hidden state from encoder,

should this be

decoder_hidden = encoder_hidden[decoder_test.n_layers:] ? Because now it is the hidden state of the second layer.

iamkissg commented 6 years ago

Hi @seabay

You might misunderstand what the hidden state of the 1st layer and the hidden state of the 2nd layer is.

Use encoder_hidden[:decoder_test.n_layers], we extract the normal time order hidden state --->, while encoder_hidden[decoder_test.n_layers:] gives us the reverse time order hidden state <---.

Though in my opinion, using which one might does not really matter, it's more common to use normal time order hidden state of Bi-RNN.

Hope it helps.

seabay commented 6 years ago

Hi @Engine-Treasure I think that number of layers is nothing related to Bi-direction, for example, encoder is a 2-layered Bi-RNN, so the hidden state has (2 * 2)=4 parts, the first two parts are the forward and backward state of layer 1, the last two parts are for layer 2.

So the question is: do we use the hidden state of the layer 1 or layer 2?

iamkissg commented 6 years ago

Hi @seabay

So sorry for my mistaking num_layers and hidden_size.

Then here comes another question, does forward hidden and backward hidden alternate in layers or forward hidden comes first?

[
layer0_forward
layer0_backward
layer1_forward
layer1_backward
] or
[
layer0_forward
layer1_forward
layer0_backward
layer1_backward
]

You can find some answers here

@spro 's answer is they alternates in layers. However the code which we're talking about seems does not match the answer.

I just get more confused, :(

seabay commented 6 years ago

hi @Engine-Treasure Based on my experiments, codes match the first layout which alternates in layers. But why @spro choose the first layer as the context vector for Decoder?

spro commented 6 years ago

This won't be a very satisfying answer, but I believe the reason is just that this is left over from a non-bidirectional encoder, and this slicing was a workaround to make it fit the decoder. The batched version is still very much a work in progress (despite the lack of recent progress).

Two better solutions would be:

zhongpeixiang commented 6 years ago

I have the same question and agree with the pattern 1 based on intuition. But through my experiments I found out the following:

https://discuss.pytorch.org/t/gru-output-and-h-n-relationship/12720

zhongpeixiang commented 6 years ago

@spro I think summing up forward and backward hidden state at the last position of encoder would not be a good idea because the backward hidden state at last position contains little information about the sentence.