Hi,
first of all thanks for your interesting work!
I'm trying to replicate the distillation concepts presented in the paper on a vanilla Transformer architecture, so my question is not strictly related to BERT / TinyBERT but is more on the Transformer side, still, I hope you could give me some hints.
In particular, I'm interested in distilling the attention matrices inside the self-attention layers. I see from the paper that you suggest to distill the unnormalized matrices before the softmax since it leads to a faster convergence; given that, my question is, do I need to fit the matrices before or after applying the attention mask? I don't know much about BERT or its training protocol, so I honestly don't know if the attention masking procedure is still used as in the original Transformer or not, and I didn't know where to look inside the code :)
Oops, nevermind, while debugging my code I realized this was a dumb question! Since the masked values are replaced by a -np.inf, using the masked matrix doesn't make sense :)
Hi, first of all thanks for your interesting work!
I'm trying to replicate the distillation concepts presented in the paper on a vanilla Transformer architecture, so my question is not strictly related to BERT / TinyBERT but is more on the Transformer side, still, I hope you could give me some hints.
In particular, I'm interested in distilling the attention matrices inside the self-attention layers. I see from the paper that you suggest to distill the unnormalized matrices before the softmax since it leads to a faster convergence; given that, my question is, do I need to fit the matrices before or after applying the attention mask? I don't know much about BERT or its training protocol, so I honestly don't know if the attention masking procedure is still used as in the original Transformer or not, and I didn't know where to look inside the code :)
Thanks in advance for your time. Alex