Why do we need different normalization for all the layers compared to the last layer in BERT during importance score calculation?

pmichel31415 / are-16-heads-really-better-than-1

Code for the paper "Are Sixteen Heads Really Better than One?"

MIT License

165 stars 14 forks source link

Got Paul's reply over email:

In all layers, the gradients wrt. a given attention head is computed for each position and then summed. Therefore, we divide by the total number of positions (=tokens). However, in the last layer in BERT for classification we only use the output of the position corresponding to the CLS token. This means that the gradients wrt. the heads at any other position will be 0 (note: this is not the case for the MLM objective). Consequently the norm of the gradient will be comparatively much lower for heads at this level. Therefore we normalize only by the number of active positions for this layer, which is one per sentence (subset_size for the whole dataset).

pmichel31415 / are-16-heads-really-better-than-1

Why do we need different normalization for all the layers compared to the last layer in BERT during importance score calculation? #11