question about the relative shift function

sooftware / conformer

[Unofficial] PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

Apache License 2.0

943 stars 174 forks source link

question about the relative shift function #30

Closed ChanganVR closed 3 years ago

ChanganVR commented 3 years ago

Hi @sooftware, thank you for coding this repo. I have a question about the relative shift function: https://github.com/sooftware/conformer/blob/c76ff16d01b149ae518f3fe66a3dd89c9ecff2fc/conformer/attention.py#L105 I don't quite understand how this function works. Could you elaborate on this?

An example input and output of size 4 is shown below, which does not really make sense to me.

Input:

tensor([[[[-0.9623, -0.3168, -1.1478, -1.3076],
          [ 0.5907, -0.0391, -0.1849, -0.6368],
          [-0.3956,  0.2142, -0.6415,  0.2196],
          [-0.8194, -0.2601,  1.1337, -0.3478]]]])

output:

tensor([[[[-1.3076,  0.0000,  0.5907, -0.0391],
          [-0.1849, -0.6368,  0.0000, -0.3956],
          [ 0.2142, -0.6415,  0.2196,  0.0000],
          [-0.8194, -0.2601,  1.1337, -0.3478]]]])

Thank you!

sooftware commented 3 years ago

It is easier to give an example. To perform relative attention, we want to relatively shift the attention score matrix as follows:

a00 a01 a02      a02  0  a10    
a10 a11 a12  =>  a11 a12  0
a20 a21 a22      a20 a21 a22

What the _relative_shift does is just a clear way of achieving the transformation above:

a00 a01 a02      0 a00 a01 a02       0  a00 a01      a02  0  a10    
a10 a11 a12  =>  0 a10 a11 a12  =>  a02  0  a10  =>  a11 a12  0  
a20 a21 a22      0 a20 a21 a22      a11 a12  0       a20 a21 a22   
                                    a20 a21 a22

Append one "column" of zeros to the left
Reshape the matrix from [3 x 4] into [4 x 3]
Remove the first "row"
Mask out the upper triangle

ChanganVR commented 3 years ago

@sooftware Thank you for your reply! I'm a bit more confused now. In the example you gave, why would a10 appear in the first row? That corresponds to how much the 1st element attends to the 0th element right?

Also, do you mind explaining a bit more about why the attention is shifted this way? e.g. a00 becomes a02 and a01 becomes 0, what are the intuitions behind this transformation? If we denote the new matrix as B, then b00 should mean the relative position information between the first element and the first element right? why would it be a02?

sooftware commented 3 years ago

I recommend you to read this post

enhuiz commented 2 years ago

Hi, I found the upper triangle of pos_score seemed not masked. Will this matter for the performance?

Also, this relative positional encoding seemed to only work with causal sequence. However, according to the original paper appendix B

i − j can only be integer from 0 to M + L − 1

which includes both directions.

windysonic commented 2 years ago

@ChanganVR i got exactly the same confusion here; The post sooftware given looks correct to me only when the upper left triangle of QE_r is set to zero before the "skewing" process; Do you have any clue on it?

ChanganVR commented 2 years ago

@windysonic I'm sorry that I don't remember any details about this issue since it's been so long.