In the paper, the left-shift is done on a Lx(M+L) (i.e. [qlen, qlen+klen]) matrix, but here it is on a LxM if I'm not mistaken. The upper right relative positional encoding are thus erroneous, no ?
I understand that this is not a problem as these ones are masked afterwards, but if we were not using the mask, the pos_seq should be computed with klen + qlen, and then truncated after left-shift before adding to term AC ?
The positional encoding in the code is:
Then used to build the
r_head_k
tensor and finally used in the BD term:In the paper, the left-shift is done on a Lx(M+L) (i.e.
[qlen, qlen+klen]
) matrix, but here it is on a LxM if I'm not mistaken. The upper right relative positional encoding are thus erroneous, no ?I understand that this is not a problem as these ones are masked afterwards, but if we were not using the mask, the
pos_seq
should be computed withklen + qlen
, and then truncated after left-shift before adding to termAC
?Or did I miss something ?