Question: why is relative positional encoding computed with length M vs. L+M in the paper ?

The positional encoding in the code is:

pos_seq = tf.range(klen - 1, -1, -1.0)
inv_freq = 1 / (10000 ** (tf.range(0, d_model, 2.0) / d_model))
pos_emb = positional_embedding(pos_seq, inv_freq)

Then used to build the r_head_k tensor and finally used in the BD term:

BD = tf.einsum('ibnd,jnd->ijbn', rr_head_q, r_head_k)
BD = rel_shift(BD)

In the paper, the left-shift is done on a Lx(M+L) (i.e. [qlen, qlen+klen]) matrix, but here it is on a LxM if I'm not mistaken. The upper right relative positional encoding are thus erroneous, no ?

I understand that this is not a problem as these ones are masked afterwards, but if we were not using the mask, the pos_seq should be computed with klen + qlen, and then truncated after left-shift before adding to term AC ?

Or did I miss something ?

kimiyoung / transformer-xl

Question: why is relative positional encoding computed with length M vs. L+M in the paper ? #132