Open rk2900 opened 6 years ago
Are you comparing to the latest implementation, at https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb ?
The implementation in the above link seems to be correct. It would be nice if this tutorial could be updated as it is still using a non-standard implementation of attention.
This caused me significant confusion when I first started playing with encoder-decoders.
@adamklec, do you know where that non-standard attent is from? Thanks.
Seconded. I spent hours trying to make sense of the (tutorial)[https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html].
Moreover, I think I still need help understanding the code over in https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb. You use self.attn = GeneralAttn(hidden_size)
in class BahdanauAttnDecoderRNN
.
GeneralAttn
is not defined anywhere. In (issue 23)[https://github.com/spro/practical-pytorch/issues/23], you mention that this is supposed to be Attn
class. However, as per my understanding, neither dot
, general
or concat
corresponds to the ones mentioned in Bahdanau et. al. But instead, they're the ones mentioned in Luong et. al. I'm not completely sure about this claim though, so I would love if someone could clarify this
Hi @geraltofrivia, I found the same issues as you. I don't understand why @spro do not fix this problem as it is so obvious confusing.
This issue also mentioned this. @spro Please fix it quickly. This mistake in tutorial has been existing for 2 years.
@soloice been four years now.... just spend a whole afternoon learning the wrong version of code...
In the tutorial, I find that the "attention" mechanism is a fake attention, since the calculated attention weights have no relationship with the encoder output vectors. The implementation in the original attention paper "Neural Machine Translation By Jointly Learning To Align and Translate" assumes that each attention weight
e_{ij}
is calculated by incorporating both decoder hidden vectors_{i-1}
andh_j
and both weightalpha_{ij}
is calculated using all the encoder outputsh_j
. All the discussion is based on the equation on Page 14 in the paper.But in your implementation, the attention weights are calculated using the decoder information, where is expressed in the code:
I think the
attn_weights
should be calculated as that in the paper. So I propose another implementation as thatThe original implementation in this tutorial has performance even worse than simple encoder-decoder. While my implementation has much better performance than the other two.
Your implementation:
My implementation:
The results above are based on one 1080 GPU.