Open ltbd78 opened 11 months ago
I think it's because it's a Multi-Head Attention, so that second dropout, as the name indicates resid_dropout
, is the dropout after the MHA sub-layer.
This is also supported by the fact that he doesn't use dropout in his Block
module, see https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/model.py#L104
What's the reasoning behind the extra dropout layer after projection?
Karpathy's implementation has 2 dropout layers:
attn_dropout
resid_dropout
Karpathy's 2nd dropout layer
https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/model.py#L40
Torch's implementation only has 1 dropout layer:
attn_dropout
Torch's MultiheadAttention forward, we only find 1 dropout layer, but not the second.
https://github.com/pytorch/pytorch/blob/3cbe7a53a9a1cea2ef2a042f1ab6f7758f7e4d74/torch/csrc/api/include/torch/nn/functional/activation.h#L910