In the youtube lecture for this repo (https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=7), exercise 1 is "- EX1: The n-dimensional tensor mastery challenge: Combine the Head and MultiHeadAttention into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).".
I believe the solution in this repo is in model.py:
In init, you specify a big linear layer: self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias).
In forward, you apply this layer and route its outputs to the three networks (K,V,Q) and their multiple heads:
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
I was wondering if (and if so, why) this is equivalent to the setup in https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py, where we a) separately evaluate the K,Q,V networks and b) separately evaluate the heads, then concatenate them. Since self.c_attn is fully connected, here we have all the different networks and heads are talking to one another in the forward pass.
Is there some intutition on what happens when all these separate networks inter-communicate through weights in the forward pass?
My bad - fuzzy thinking. The different subnetworks would talk to one-another in the backwards pass if there were more than one layer in self.c_attn, but there's no weight sharing in this one-layer case.
In the youtube lecture for this repo (https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=7), exercise 1 is "- EX1: The n-dimensional tensor mastery challenge: Combine the
Head
andMultiHeadAttention
into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).".I believe the solution in this repo is in
model.py
:self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
.I was wondering if (and if so, why) this is equivalent to the setup in https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py, where we a) separately evaluate the K,Q,V networks and b) separately evaluate the heads, then concatenate them. Since self.c_attn is fully connected, here we have all the different networks and heads are talking to one another in the forward pass.
Is there some intutition on what happens when all these separate networks inter-communicate through weights in the forward pass?