ofirpress / attention_with_linear_biases

Code for the ALiBi method for transformer language models (ICLR 2022)
MIT License
497 stars 38 forks source link

could we apply Alibi with rotary position embedding? #13

Closed xiaoxiawu-microsoft closed 1 year ago

xiaoxiawu-microsoft commented 1 year ago

Hi Ofir (@ofirpress ),

Thanks for the great work. ALiBi greatly solves many problems :) I'm curious if you've attempted to integrate any other position embedding techniques with the ALiBi approach. It appears that these methods would not conflict with each other, as the former is applied prior to the input being fed to the layers, while ALiBi is integrated just before the computation of the softmax attention score.

Thank you again for the great work. best, Xiaoxia

ofirpress commented 1 year ago

You can try this but I think they probably would conflict with each other. Models trained with absolute position embeddings overfit to those positions. ROPE uses absolute position embeddings. You can read more about my thoughts on this here: https://ofir.io/The-Use-Case-for-Relative-Position-Embeddings/