Open whchan05 opened 1 year ago
This does have W^O, it's here: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L42
But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.
The paper does the following:
You can think of the paper as doing 3 * n_head linear projections.
This repo instead does two things, all via c_attn: 1) calculates the k, q, v all at once 2) does this for every head
The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.
Fwiw, you can see him talking about this part here: https://youtu.be/kCc8FmEb1nY?feature=shared&t=4919
I found this excerpt from the paper clarifying:
It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you