Solution to Exercise 1 from Youtube Lecture (Batching the heads) - Why does it work?

In the youtube lecture for this repo (https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=7), exercise 1 is "- EX1: The n-dimensional tensor mastery challenge: Combine the Head and MultiHeadAttention into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).".

I believe the solution in this repo is in model.py:

In init, you specify a big linear layer: self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias).
In forward, you apply this layer and route its outputs to the three networks (K,V,Q) and their multiple heads:

q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

I was wondering if (and if so, why) this is equivalent to the setup in https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py, where we a) separately evaluate the K,Q,V networks and b) separately evaluate the heads, then concatenate them. Since self.c_attn is fully connected, here we have all the different networks and heads are talking to one another in the forward pass.

Is there some intutition on what happens when all these separate networks inter-communicate through weights in the forward pass?

karpathy / nanoGPT

Solution to Exercise 1 from Youtube Lecture (Batching the heads) - Why does it work? #536