Attention layer in GradTTS

patrickvonplaten commented 2 years ago

Hey @ivanvovk et al.

Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here: https://github.com/huawei-noah/Speech-Backbones/blob/b82fdd546d9d977573c8557f242b06a0770ece8e/Grad-TTS/model/diffusion.py#L95

computes the softmax on the projected key values instead of computing it on the product of query and key.

Usually, I know self-attention as:

Value x Softmax(Query x Key^T / d_k)

but it seems like here it is

(Value x Softmax(Key)) x Query

=> Is it similar to self-attention? Where does it come from?

Best, Patrick

ivanvovk commented 2 years ago

Hi, @patrickvonplaten! Sorry for late reply and thank you very much for pointing that issue out!

Actually, this is the form of compute-and-memory-efficient attention mechanism called Efficient Attention. Mathematically, it is claimed to be approximately equivalent to the classical dot-product attention.

However, unfortunately, we noticed that we missed taking softmax of the query vectors, our bad. Nonetheless, at the same time, taking softmax is just a form of normalization, so no surprise it worked out of the box as well.

patrickvonplaten commented 2 years ago

I see that makes sense! Thanks for replying so quickly!

huawei-noah / Speech-Backbones

Attention layer in GradTTS #15