Output of CausalSelfAttention

This does have W^O, it's here: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L42

But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.

The paper does the following:

takes the embedding vector n_heads times, and does a linear projection that reduces the size (3 times)
that results in the k, q, v for each head

You can think of the paper as doing 3 * n_head linear projections.

This repo instead does two things, all via c_attn: 1) calculates the k, q, v all at once 2) does this for every head

The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.

Fwiw, you can see him talking about this part here: https://youtu.be/kCc8FmEb1nY?feature=shared&t=4919

I found this excerpt from the paper clarifying:

karpathy / minGPT

Output of CausalSelfAttention #118