RoPE embeddings - Githubissues

My conclusions about changing the positional encoding are that NOPE and ALiBi do not work well for only-encoders because, compared to only-decoders, they do not understand position at all (they are permutation equivariant). However, RoPE (Rotary Position Embedding) seems promising because, although it cannot extrapolate directly, it can be trained for longer sequences with only 1000 training steps. Even if it doesn't work perfectly, it allows for relative positional encoding (we can see it as a imrpovement of sinusoidal positional encoding), which I believe makes a lot of sense in music. This is likely why the authors of Transformer++ used it. Additionally, RoPE seems to accelerate convergence and improve model stability, which is why even famous only decoder LLMS (LLAMA) use it, despite ALiBi's ability to extrapolate it is very unstable during training.

we can borrow the code from here https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py

lucidrains / rotary-embedding-torch

RoPE embeddings #30