I would like to say first thanks for writing this amazing book. And then ask a question about the attention mechanism in Transformers (referring to page 61). I am trying to compare the meaning and mechanism of what is named as Self Attention in Transformers with what I previously knew as Self attention from this paper:https://aclanthology.org/N16-1174.pdf and local and general attention from the following : https://arxiv.org/pdf/1508.04025.pdf
What it has been used in these papers was HAN model with Self,local or Global attention on top of RNN , GRU, LSTM or CNN layers. As Transformers are new architecture , I am wondering if the mathematics behind the attention is same as these 2 papers or not?
Please forgive me if the question seems very basic for you.
Information
The question or comment is about chapter:
Question or comment
Hi
I would like to say first thanks for writing this amazing book. And then ask a question about the attention mechanism in Transformers (referring to page 61). I am trying to compare the meaning and mechanism of what is named as Self Attention in Transformers with what I previously knew as Self attention from this paper:https://aclanthology.org/N16-1174.pdf and local and general attention from the following : https://arxiv.org/pdf/1508.04025.pdf What it has been used in these papers was HAN model with Self,local or Global attention on top of RNN , GRU, LSTM or CNN layers. As Transformers are new architecture , I am wondering if the mathematics behind the attention is same as these 2 papers or not?
Please forgive me if the question seems very basic for you.
Regards Shabnam