karpathy / minGPT

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
MIT License
20.2k stars 2.51k forks source link

Output of CausalSelfAttention #118

Open whchan05 opened 1 year ago

whchan05 commented 1 year ago

It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you

theicfire commented 6 months ago

This does have W^O, it's here: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L42

But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.

The paper does the following:

You can think of the paper as doing 3 * n_head linear projections.

This repo instead does two things, all via c_attn: 1) calculates the k, q, v all at once 2) does this for every head

The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.

Fwiw, you can see him talking about this part here: https://youtu.be/kCc8FmEb1nY?feature=shared&t=4919

I found this excerpt from the paper clarifying:

image