huawei-noah / Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
3.02k stars 628 forks source link

Question about the attention based distillation #90

Closed alexmonti19 closed 4 years ago

alexmonti19 commented 4 years ago

Hi, first of all thanks for your interesting work!

I'm trying to replicate the distillation concepts presented in the paper on a vanilla Transformer architecture, so my question is not strictly related to BERT / TinyBERT but is more on the Transformer side, still, I hope you could give me some hints.

In particular, I'm interested in distilling the attention matrices inside the self-attention layers. I see from the paper that you suggest to distill the unnormalized matrices before the softmax since it leads to a faster convergence; given that, my question is, do I need to fit the matrices before or after applying the attention mask? I don't know much about BERT or its training protocol, so I honestly don't know if the attention masking procedure is still used as in the original Transformer or not, and I didn't know where to look inside the code :)

Thanks in advance for your time. Alex

alexmonti19 commented 4 years ago

Oops, nevermind, while debugging my code I realized this was a dumb question! Since the masked values are replaced by a -np.inf, using the masked matrix doesn't make sense :)

Alex