karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
37.49k stars 5.97k forks source link

Solution to Exercise 1 from Youtube Lecture (Batching the heads) - Why does it work? #536

Closed Andrew-Luo1 closed 4 months ago

Andrew-Luo1 commented 4 months ago

In the youtube lecture for this repo (https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=7), exercise 1 is "- EX1: The n-dimensional tensor mastery challenge: Combine the Head and MultiHeadAttention into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).".

I believe the solution in this repo is in model.py:

  1. In init, you specify a big linear layer: self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias).
  2. In forward, you apply this layer and route its outputs to the three networks (K,V,Q) and their multiple heads:
q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

I was wondering if (and if so, why) this is equivalent to the setup in https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py, where we a) separately evaluate the K,Q,V networks and b) separately evaluate the heads, then concatenate them. Since self.c_attn is fully connected, here we have all the different networks and heads are talking to one another in the forward pass.

Is there some intutition on what happens when all these separate networks inter-communicate through weights in the forward pass?

Andrew-Luo1 commented 4 months ago

My bad - fuzzy thinking. The different subnetworks would talk to one-another in the backwards pass if there were more than one layer in self.c_attn, but there's no weight sharing in this one-layer case.