lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.07k stars 143 forks source link

Causal performer slower than causal regular attention #66

Open JamesDeAntonis opened 3 years ago

JamesDeAntonis commented 3 years ago

For some reason, our causal performer runs slower than that of causal regular attention. You observe that performer is faster, even in the causal case right? Curious how to troubleshoot this (we don't use the full PerformerLM, just CrossAttention and SelfAttention, not sure if that's relevant)

lucidrains commented 3 years ago

@JamesDeAntonis do you mean on training or eval?

JamesDeAntonis commented 3 years ago

We observed in both. I heard from here that the reason is caching? Are you still planning to implement it?

lucidrains commented 3 years ago

@JamesDeAntonis training is as fast as it can be - basically, if you are training at less than 2048 context length, you should expect it to be same or slower

eval should be really fast though, and that's something i could work on. it should be as fast as an RNN in the end. i'll take a look at it later this week!