Closed Hritikbansal closed 2 years ago
Got Paul's reply over email:
In all layers, the gradients wrt. a given attention head is computed for each position and then summed. Therefore, we divide by the total number of positions (=tokens). However, in the last layer in BERT for classification we only use the output of the position corresponding to the CLS token. This means that the gradients wrt. the heads at any other position will be 0 (note: this is not the case for the MLM objective). Consequently the norm of the gradient will be comparatively much lower for heads at this level. Therefore we normalize only by the number of active positions for this layer, which is one per sentence (subset_size for the whole dataset).
Hi,
I am trying to understand why did we need different normalization factors for the last layer of the BERT compared to all other layers?
https://github.com/pmichel31415/pytorch-pretrained-BERT/blob/18a86a7035cf8a48d16c101a66e439bf6ab342f1/examples/classifier_eval.py#L246 vs https://github.com/pmichel31415/pytorch-pretrained-BERT/blob/18a86a7035cf8a48d16c101a66e439bf6ab342f1/examples/classifier_eval.py#L247