Open patrickvonplaten opened 2 years ago
Hi, @patrickvonplaten! Sorry for late reply and thank you very much for pointing that issue out!
Actually, this is the form of compute-and-memory-efficient attention mechanism called Efficient Attention. Mathematically, it is claimed to be approximately equivalent to the classical dot-product attention.
However, unfortunately, we noticed that we missed taking softmax of the query vectors, our bad. Nonetheless, at the same time, taking softmax is just a form of normalization, so no surprise it worked out of the box as well.
I see that makes sense! Thanks for replying so quickly!
Hey @ivanvovk et al.
Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here: https://github.com/huawei-noah/Speech-Backbones/blob/b82fdd546d9d977573c8557f242b06a0770ece8e/Grad-TTS/model/diffusion.py#L95
computes the softmax on the projected key values instead of computing it on the product of query and key.
Usually, I know self-attention as:
Value x Softmax(Query x Key^T / d_k)
but it seems like here it is
(Value x Softmax(Key)) x Query
=> Is it similar to self-attention? Where does it come from?
Best, Patrick