lucidrains / x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.42k stars 377 forks source link

RoPE inconsistency (2-dim subspaces choice) #250

Closed gordicaleksa closed 2 months ago

gordicaleksa commented 2 months ago

Hi Phil!

I noticed your x-transformers RoPE implementation is different from your standalone rotary-embedding-torch implementation.

Example: Assume that the vector we're rotating has coordinates [x1,x2,...,x16,x17,...,x32] x-transformers rotates (x1-x17), (x2-x18) pairs of coordinates rotary-embedding-torch rotates (x1-x2), (x3-x4) consecutive coordinates (which is how it's defined in the ReFormer paper as well)

It looks to me these 2 are equivalent, picking 2-dim subspaces can be done both ways, but I was puzzled as the latter one is much more intuitive, and wanted to see if you have any additional insight? :)

Additionally I did notice that you only rotate half of the vector while the other half is left unchanged - any references for why do that?