lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.07k stars 143 forks source link

Recover attention scores #70

Open carlomarxdk opened 3 years ago

carlomarxdk commented 3 years ago

Is it possible to recover the attention scores from the Fast Attention module?

gaganbahga commented 3 years ago

I don't believe that's possible because the order of computation is (Q' (K'^T V)). Would be interesting to know someone has a different idea/workaround.

WintrumWang commented 3 years ago

In performer paper, the author use a special "V", which is a diagonal matrix (one-hot indicators), then the attention outputs just equal attention scores. I suggest you read the paragraphs around Figure 10 in the paper. However, I have trouble in the implementation of it, because it is confusing to pass both attention scores and results to other functions/classes meantime.

WintrumWang commented 3 years ago

In performer paper, the author use a special "V", which is a diagonal matrix (one-hot indicators), then the attention outputs just equal attention scores. I suggest you read the paragraphs around Figure 10 in the paper. However, I have trouble in the implementation of it, because it is confusing to pass both attention scores and results to other functions/classes meantime.

@lucidrains Could you please help us about the implementation of obtain attention weights?