Discussion on "Attention is All You Need"

zafarali commented 6 years ago

Let's write down some of our takeaways from "Attention is All You Need" and then one of us can collate them into a single document to put into this repo so that we can remind ourselves when we forget. I'll start by saying a few words about what I understood about the actual computation of attention in this model when i get some time tonight

NicolasAG commented 6 years ago

Alright, I'll try to explain what I understood from our discussion on Thursday. I may be mistaken in a few things though, this is just a first attempt :) My confusion was from the switch between step-by-step thinking and all-in-one-step thinking. In the code and in the paper they write matrix multiplications to be more efficient but to start understanding it's easier to consider the step-by-step case.

In the encoding phase:

we start with word vectors w_i for each token in our source sentence.
each word vector is mapped to a lower dimension vector w-_i with a matrix multiplication (I think this step is not so important..?)
now, each token i in the source sequence will be encoded in the next level as a weighted sum of all tokens in the source sequence (the j's in the next equation) at the previous level. We have h1_i = sum_j { alpha1_j * w-_j } where at level 0 we have word vectors (or there lower dimension representation w-_is ) and at level1 we have encodings h1_is.
we repeat this process a bunch of times to get multiple levels, or multiple representations of the same word. ie: h2_i = sum_j {alpha2_j * h1_j} ; h3_i = sum_j {alpha3_j * h2_j} ; ...

Now, I think:

the Values (V), for these steps, are the w-_j's, then the h1_j's, then the h2_j's, etc... for ALL j from 1 to n with n=len(source sentence). So they are the current representations of all the words, they will be multiplied by attention weights.
the Query (Q) is (only) used to compute the alphas... (see below).
the Key (K) is (only) used to calculate the alphas... (see below).

So how to compute the alpha? This is what I understood, may be wrong lol:

alpha1_i ~= softmax(w-_j DOT w-_i) where w-_j is of size (n, hs) and w-_i is of size (hs, 1). So the alphas are basically the dot product between the current word representation (w-_i) and the current representation of all the words (w-_j).
same thing for the attention weights of higher levels: alpha2_i ~= softmax(h1_j DOT h1_i) with h1_j of shape (n, hs) and h1_i of shape (hs, 1).

And so now I think:

the Values (V) are not used to compute the alphas.
the Query (Q) is the "current representation of the word we are trying to encode", so in this case, it would be the w-_i, then the h1_i, etc... for multiple levels. Query have shape (hs, 1).
the Keys (K) are the "current representations of all the words", so it would be w-_j, then h1_j, etc... for multiple levels with j going from 1 to n. Keys have shape (n, hs).

Note: during encoding the Keys and the Values are actually the exact same thing, but they are used for different things: values are being multiplied by attention weights and are being updated; keys are just used to compute the attention weights. The Query is also more or less the same thing as the other two: it is used to compute the attention weights of each specific token. it's the representation of one token (token i). However, if we want to be efficient and compute the attention weights for all tokens at the same time, then the Query becomes the representation of all tokens, so a matrix of size (n, hs): it is exactly the same as the Values and the Keys.

During decoding, there are two things happening: both very similar to the above I think... Decoding is when the Query and the Keys are different at some point. This explanation of the encoding step is already quite long so I'll stop here for now. the decoding step is for someone else to explain or for a future time from me lol.

zafarali commented 6 years ago

Might be interesting: http://nlp.seas.harvard.edu/2018/04/03/attention.html

koustuvsinha / dl_ling_papers

Discussion on "Attention is All You Need" #3