Open JamesDeAntonis opened 3 years ago
@JamesDeAntonis do you mean on training or eval?
We observed in both. I heard from here that the reason is caching? Are you still planning to implement it?
@JamesDeAntonis training is as fast as it can be - basically, if you are training at less than 2048 context length, you should expect it to be same or slower
eval should be really fast though, and that's something i could work on. it should be as fast as an RNN in the end. i'll take a look at it later this week!
For some reason, our causal performer runs slower than that of causal regular attention. You observe that performer is faster, even in the causal case right? Curious how to troubleshoot this (we don't use the full PerformerLM, just CrossAttention and SelfAttention, not sure if that's relevant)