Closed tylerkastner closed 1 week ago
In the code below, the query-key dot product is normalized by multiplying by the square root of the head size: https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L83 Should we not be dividing instead? As seen in the original paper:
It's raising to the power of -0.5, so it's effectively dividing it.
-0.5
Of course! I missed that minus. Thanks for spotting that 🤦
In the code below, the query-key dot product is normalized by multiplying by the square root of the head size: https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L83 Should we not be dividing instead? As seen in the original paper: