karpathy / ng-video-lecture

3.46k stars 902 forks source link

About gpt.py line 134-135 #21

Open hufuzhipeng opened 1 year ago

hufuzhipeng commented 1 year ago

Acording to the paper of transformer , it seems that we can change x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) to x = self.ln1(x + self.sa(x)) x = self.ln2(x + self.ffwd(x)) Although the result is similar.

fasterinnerlooper commented 8 months ago

Yes. In his video, he does go over why he's doing this. You can see his explanation here: https://youtu.be/kCc8FmEb1nY?si=VFtUYR-MjtrjR-Lw&t=5722 It's because there has been a "reshuffling" of the structure, as he puts it.