koustuvsinha / dl_ling_papers

Some summaries of papers in the confluence of Deep Learning and Linguistics / NLP
3 stars 2 forks source link

Discussion on "Attention is All You Need" #3

Open zafarali opened 6 years ago

zafarali commented 6 years ago

Let's write down some of our takeaways from "Attention is All You Need" and then one of us can collate them into a single document to put into this repo so that we can remind ourselves when we forget. I'll start by saying a few words about what I understood about the actual computation of attention in this model when i get some time tonight

NicolasAG commented 6 years ago

Alright, I'll try to explain what I understood from our discussion on Thursday. I may be mistaken in a few things though, this is just a first attempt :) My confusion was from the switch between step-by-step thinking and all-in-one-step thinking. In the code and in the paper they write matrix multiplications to be more efficient but to start understanding it's easier to consider the step-by-step case.

In the encoding phase:

Now, I think:

So how to compute the alpha? This is what I understood, may be wrong lol:

And so now I think:

Note: during encoding the Keys and the Values are actually the exact same thing, but they are used for different things: values are being multiplied by attention weights and are being updated; keys are just used to compute the attention weights. The Query is also more or less the same thing as the other two: it is used to compute the attention weights of each specific token. it's the representation of one token (token i). However, if we want to be efficient and compute the attention weights for all tokens at the same time, then the Query becomes the representation of all tokens, so a matrix of size (n, hs): it is exactly the same as the Values and the Keys.

During decoding, there are two things happening: both very similar to the above I think... Decoding is when the Query and the Keys are different at some point. This explanation of the encoding step is already quite long so I'll stop here for now. the decoding step is for someone else to explain or for a future time from me lol.

zafarali commented 6 years ago

Might be interesting: http://nlp.seas.harvard.edu/2018/04/03/attention.html