rizar / attention-lvcsr

End-to-End Attention-Based Large Vocabulary Speech Recognition
MIT License
262 stars 99 forks source link

Convolutional Attention #10

Closed ArneNx closed 8 years ago

ArneNx commented 8 years ago

Hi,

in conv1d you do a dimshuffle for the input sequences like this:

sequences.dimshuffle('x', 'x', 0, 1)

This would mean that we now shifted batch_size to the third dimension. But in theano.tensor.nnet.conv2d the batches are expected in the first dimension and the image height in the third dimension. Is this done on purpose? If so, for what reason?

dmitriy-serdyuk commented 8 years ago

I don't know this part of code very well, but seems that convolving 1d filter over B images (1, img_size) is the same as convolving 1d filter over (B, img_size) "concatenated" image. Where B is the batch size.

ArneNx commented 8 years ago

I don't believe this is true here. If I am not mistaken, conv1d is handing over the data to theano.tensor.nnet.conv2d and this then applies a 2D convolution over these images we created from 1D data. Therefore we would also convolute over the different elements in the batch. I don't think this is what we want here (since this is not mentioned in the paper).

rizar commented 8 years ago

@ArneNx , did you check in the code that in the inputs to conv1d the dimension 0 stands for the index in the batch and not for the time? I have already forgotten the implementation details, but this would explain your confusion.

dmitriy-serdyuk commented 8 years ago

@ArneNx , the filter always has shape 1 over the batch dimension, so convolving over the elements of the batch doesn't do anything.

ArneNx commented 8 years ago

@dmitriy-serdyuk Yes, I think you are right about that. Thanks for the clarification!