Closed Vbansal21 closed 2 years ago
@Vbansal21 Hi! I've looked at both of those papers, but I do not think they will come out ahead against rotary embeddings https://arxiv.org/abs/2104.09864 https://blog.eleuther.ai/rotary-embeddings/ (already built into the repository and turned on by default)
welcome to be proven wrong if you can do some separate runs between the three and show me that one convincingly wins against rotary, in which case I will be happy to spend time integrating it
I ran training on three seperately, but my models denied convergence on any of them😂. Will try again soon.
Closing the issue. Results of rotary were slightly better than these linear pos enc.
There is a stochastic pos. enc. method introduced for linear attn models like performer: https://github.com/aliutkus/spe/tree/main/src/pytorch https://arxiv.org/pdf/2105.08399.pdf and one more method was introduced recently: https://arxiv.org/pdf/2102.07680.pdf https://github.com/ExpectationMax/Translational-Equivariant-Performers
Maybe consider adding those to this performer architecture for encoder-decoder model.