Open LeoPerelli opened 1 year ago
Why do you think the transformer paper was not using a bias? It just says linear projections in the paper:
Hey! I interpret linear projection as applying a matrix, while I think the transformation Ax+b to be called affine transformation. Anyways, guess it's just a tiny detail of the implementation!
Oh yeah the term affine is rarely used in ML papers, they'll just say linear, so it's simply ambiguous what the authors were implying.
Hi @karpathy and thanks for your work!
I noticed that in the definition of the self-attention matrices you use a Linear layer which has bias, while I don't expect it to be there as we only want the matrix. I am talking about:
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
Is this wanted or is it just a small bug? Thanks!