Relative Positional Encoding for Linear Attention Models.

lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch

MIT License

1.08k stars 141 forks source link

Relative Positional Encoding for Linear Attention Models. #72

Closed Vbansal21 closed 2 years ago

Vbansal21 commented 3 years ago

There is a stochastic pos. enc. method introduced for linear attn models like performer: https://github.com/aliutkus/spe/tree/main/src/pytorch https://arxiv.org/pdf/2105.08399.pdf and one more method was introduced recently: https://arxiv.org/pdf/2102.07680.pdf https://github.com/ExpectationMax/Translational-Equivariant-Performers

Maybe consider adding those to this performer architecture for encoder-decoder model.

lucidrains commented 3 years ago

@Vbansal21 Hi! I've looked at both of those papers, but I do not think they will come out ahead against rotary embeddings https://arxiv.org/abs/2104.09864 https://blog.eleuther.ai/rotary-embeddings/ (already built into the repository and turned on by default)

welcome to be proven wrong if you can do some separate runs between the three and show me that one convincingly wins against rotary, in which case I will be happy to spend time integrating it

Vbansal21 commented 3 years ago

I ran training on three seperately, but my models denied convergence on any of them😂. Will try again soon.

Vbansal21 commented 2 years ago

Closing the issue. Results of rotary were slightly better than these linear pos enc.