Open carlomarxdk opened 3 years ago
I don't believe that's possible because the order of computation is (Q' (K'^T V))
. Would be interesting to know someone has a different idea/workaround.
In performer paper, the author use a special "V", which is a diagonal matrix (one-hot indicators), then the attention outputs just equal attention scores. I suggest you read the paragraphs around Figure 10 in the paper. However, I have trouble in the implementation of it, because it is confusing to pass both attention scores and results to other functions/classes meantime.
In performer paper, the author use a special "V", which is a diagonal matrix (one-hot indicators), then the attention outputs just equal attention scores. I suggest you read the paragraphs around Figure 10 in the paper. However, I have trouble in the implementation of it, because it is confusing to pass both attention scores and results to other functions/classes meantime.
@lucidrains Could you please help us about the implementation of obtain attention weights?
Is it possible to recover the attention scores from the Fast Attention module?