Open hugofloresgarcia opened 1 year ago
Hopefully you already found your answer. But for anyone else who finds this issue: Look at the equations for Q and K in Algorithm 1 in the xPos paper. The factor T is used to shrink the queries and grow the keys. This is a clever trick that makes the Q-K products smaller if Q and K are far apart. But it only works that way if the keys come before the queries, as in a decoder.
@tfglynn, can xPos be used for a regular regular encoder-decoder model? If so, from your answer above, I assume that it should be added to the decoder side only and not the encoder?
Hi! I'm interested in using the rotary embeddings with
x_pos=True
so my transformer is length-extrapolable. However, I noticed the readme mentions this technique works only with autoregressive transformers. Is there a reason why this wouldn't work with an encoder-only bidirectional transformer?Thanks!