lucidrains / x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.42k stars 377 forks source link

Question: rotary embeddings and bad length extrapolation #241

Closed pfeatherstone closed 2 months ago

pfeatherstone commented 5 months ago

In my tests, I've uncovered that rotary embeddings don't length extrapolate well. To be fair, you do mention this in your README. You suggest using rotary_xpos = True which should fix this but your attention becomes local.

Is this the best way to have good length extrapolation in a transformer network? Or is there a better positional embedding that doesn't suffer from this, yet works with flash attention and key-value mems?

I'll try using rotary_xpos but I don't like the idea of shortening the context length from something potentially really big to something small.

Thank you

pfeatherstone commented 5 months ago

Other candidates are Alibi or no embeddings at all. For the last one, in order for it to work, do you need to train with a range of sizes so it can learn to length extrapolate well?