Question: rotary embeddings and bad length extrapolation

lucidrains / x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers

MIT License

4.42k stars 377 forks source link

In my tests, I've uncovered that rotary embeddings don't length extrapolate well. To be fair, you do mention this in your README. You suggest using rotary_xpos = True which should fix this but your attention becomes local.

Is this the best way to have good length extrapolation in a transformer network? Or is there a better positional embedding that doesn't suffer from this, yet works with flash attention and key-value mems?

I'll try using rotary_xpos but I don't like the idea of shortening the context length from something potentially really big to something small.

Thank you

lucidrains / x-transformers

Question: rotary embeddings and bad length extrapolation #241