pmichel31415 / are-16-heads-really-better-than-1

Code for the paper "Are Sixteen Heads Really Better than One?"
MIT License
165 stars 14 forks source link

Why do we need different normalization for all the layers compared to the last layer in BERT during importance score calculation? #11

Closed Hritikbansal closed 2 years ago

Hritikbansal commented 2 years ago

Hi,

I am trying to understand why did we need different normalization factors for the last layer of the BERT compared to all other layers?

https://github.com/pmichel31415/pytorch-pretrained-BERT/blob/18a86a7035cf8a48d16c101a66e439bf6ab342f1/examples/classifier_eval.py#L246 vs https://github.com/pmichel31415/pytorch-pretrained-BERT/blob/18a86a7035cf8a48d16c101a66e439bf6ab342f1/examples/classifier_eval.py#L247

Hritikbansal commented 2 years ago

Got Paul's reply over email:

In all layers, the gradients wrt. a given attention head is computed for each position and then summed. Therefore, we divide by the total number of positions (=tokens). However, in the last layer in BERT for classification we only use the output of the position corresponding to the CLS token. This means that the gradients wrt. the heads at any other position will be 0 (note: this is not the case for the MLM objective). Consequently the norm of the gradient will be comparatively much lower for heads at this level. Therefore we normalize only by the number of active positions for this layer, which is one per sentence (subset_size for the whole dataset).