lucidrains / rotary-embedding-torch

Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch
MIT License
565 stars 44 forks source link

RoPE embeddings #30

Open PRamoneda opened 2 months ago

PRamoneda commented 2 months ago

My conclusions about changing the positional encoding are that NOPE and ALiBi do not work well for only-encoders because, compared to only-decoders, they do not understand position at all (they are permutation equivariant). However, RoPE (Rotary Position Embedding) seems promising because, although it cannot extrapolate directly, it can be trained for longer sequences with only 1000 training steps. Even if it doesn't work perfectly, it allows for relative positional encoding (we can see it as a imrpovement of sinusoidal positional encoding), which I believe makes a lot of sense in music. This is likely why the authors of Transformer++ used it. Additionally, RoPE seems to accelerate convergence and improve model stability, which is why even famous only decoder LLMS (LLAMA) use it, despite ALiBi's ability to extrapolate it is very unstable during training.

we can borrow the code from here https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py

VarunGumma commented 2 weeks ago

Here is a relevant paper we had written recently on the same topic: https://arxiv.org/abs/2408.11382