Question about the attention based distillation

Hi, first of all thanks for your interesting work!

I'm trying to replicate the distillation concepts presented in the paper on a vanilla Transformer architecture, so my question is not strictly related to BERT / TinyBERT but is more on the Transformer side, still, I hope you could give me some hints.

In particular, I'm interested in distilling the attention matrices inside the self-attention layers. I see from the paper that you suggest to distill the unnormalized matrices before the softmax since it leads to a faster convergence; given that, my question is, do I need to fit the matrices before or after applying the attention mask? I don't know much about BERT or its training protocol, so I honestly don't know if the attention masking procedure is still used as in the original Transformer or not, and I didn't know where to look inside the code :)

Thanks in advance for your time. Alex

huawei-noah / Pretrained-Language-Model

Question about the attention based distillation #90