In 2d attention, do the dimension of the encoder output and that of the decoder input have to agree?

anglil commented 6 years ago

In 1d attention, the dimensions do not have to agree because the decoder input just consults the encoder output as memory, and the memory length can be different from the query length. However, it seems that 2d attention requires that dim(encoder_output) == dim(decoder_input). Why is that? Can't a decoder input block consult a chunk of memory that's of a different size when doing the dot-product-based attention?

Specifically, when doing enc-dec 2d attention with memory!=None, the following line https://github.com/tensorflow/tensor2tensor/blob/5233253f2e5af8d342305e98a89697f20e2b354a/tensor2tensor/layers/common_attention.py#L2215 seems to do 1d conv, which I think should be 2d conv in preparing query, key and value.

rsepassi commented 6 years ago

Conv1d with a stride of 1 is equivalent to a linear projection (XW) and most of the calls there have been replaced with tf.layers.dense. Does that address your question?

anglil commented 6 years ago

Yes, I understand what conv1d is doing, but I'm less sure about what multihead conv2d is doing in transformer (it's not used by any module it seems). From my understanding, what conv2d does seems to be decomposing 4d input into blocks on dimension 1 and 2, flatten each block and do conv1d on it. Is that correct? If so, is there a way to do multihead conv2d without encoder output and decoder input having equal sizes?

colmantse commented 6 years ago

modify the transformer body and connect the encoder output and decoder input with a dense layer perhaps?

tensorflow / tensor2tensor

In 2d attention, do the dimension of the encoder output and that of the decoder input have to agree? #434