Length Extrapolatable Rotary Embeddings

lucidrains / rotary-embedding-torch

Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch

MIT License

573 stars 44 forks source link

Length Extrapolatable Rotary Embeddings #3

Open hugofloresgarcia opened 1 year ago

hugofloresgarcia commented 1 year ago

Hi! I'm interested in using the rotary embeddings with x_pos=True so my transformer is length-extrapolable. However, I noticed the readme mentions this technique works only with autoregressive transformers. Is there a reason why this wouldn't work with an encoder-only bidirectional transformer?

Thanks!

tfglynn commented 1 year ago

Hopefully you already found your answer. But for anyone else who finds this issue: Look at the equations for Q and K in Algorithm 1 in the xPos paper. The factor T is used to shrink the queries and grow the keys. This is a clever trick that makes the Q-K products smaller if Q and K are far apart. But it only works that way if the keys come before the queries, as in a decoder.

VarunGumma commented 7 months ago

@tfglynn, can xPos be used for a regular regular encoder-decoder model? If so, from your answer above, I assume that it should be added to the decoder side only and not the encoder?