karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
37.5k stars 5.97k forks source link

Second Dropout layer not present in nn.MultiheadAttention implementation but Karpathy has it in his? #399

Open ltbd78 opened 11 months ago

ltbd78 commented 11 months ago

What's the reasoning behind the extra dropout layer after projection?

Karpathy's implementation has 2 dropout layers:

  1. attn_dropout
  2. resid_dropout

Karpathy's 2nd dropout layer

https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/model.py#L40

Torch's implementation only has 1 dropout layer:

  1. attn_dropout

Torch's MultiheadAttention forward, we only find 1 dropout layer, but not the second.

https://github.com/pytorch/pytorch/blob/3cbe7a53a9a1cea2ef2a042f1ab6f7758f7e4d74/torch/csrc/api/include/torch/nn/functional/activation.h#L910

ReinforcedKnowledge commented 9 months ago

I think it's because it's a Multi-Head Attention, so that second dropout, as the name indicates resid_dropout, is the dropout after the MHA sub-layer.

This is also supported by the fact that he doesn't use dropout in his Block module, see https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/model.py#L104