Thank you for sharing this repo.
I have a question about the loss used to train TinyBert. Unlike DistilBert, MobileBert and other distillation based BERT variants, TinyBert training doesn't include the the student loss w.r.t the actual label but only to that of the teacher, not even in the task specific training. Have you tried including it in the loss function?
Thank you for sharing this repo. I have a question about the loss used to train TinyBert. Unlike DistilBert, MobileBert and other distillation based BERT variants, TinyBert training doesn't include the the student loss w.r.t the actual label but only to that of the teacher, not even in the task specific training. Have you tried including it in the loss function?
Thanks in advance,