karpathy / ng-video-lecture

3.6k stars 945 forks source link

Shouldn't we be dividing when normalizing QK^T, not multiplying? #46

Closed tylerkastner closed 1 week ago

tylerkastner commented 2 months ago

In the code below, the query-key dot product is normalized by multiplying by the square root of the head size: https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L83 Should we not be dividing instead? As seen in the original paper:

Screenshot 2024-09-19 at 2 26 20 PM
rishabhjain1198 commented 1 week ago

It's raising to the power of -0.5, so it's effectively dividing it.

tylerkastner commented 1 week ago

Of course! I missed that minus. Thanks for spotting that 🤦