Closed MarsJacobs closed 3 years ago
Hi, the results of top row in table 6 come only from the KD loss(both intermediate layers and logit), just as the paper or the code shows. And this produced the best results in our experiment. '-Trm' means we train the ternary model with the distillation loss of logit and this is just to show the effectiveness of KD, compared with '-Trm-logits'. Thanks for your attention!
Thanks for your kind reply! Would it be okay if I ask one more question regarding this?
Is there any specific reason for using only KD Loss? (just because using KD Loss solely has better performance than using KD Loss and GT Label Loss?)
Using only KD Loss is just the experience come from tinybert, and in our experiment adding GT Label Loss does not contribute to the final result. But I'm not saying that adding GT Label Loss is useless, there might be something to explore that we didn't.
Thanks for your reply!
Hi, Thanks for this great source code. It really helps me a lot!
While I'm studying the TernaryBERT with Paper and this source code, I have a question about KD Training Loss. In Paper Algorithm1, It says that when compute the gradient, It only uses Distillation Loss, not the Distillation Loss + GT Label Cross Entropy Loss.
And also in source code, there is only KD loss which is used for backward. https://github.com/huawei-noah/Pretrained-Language-Model/blob/54ca698e4f907f32a108de371a42b76f92e7686d/TernaryBERT/quant_task_glue.py#L363-L392
Does TernaryBERT only use KD loss and not using ground truth label as training objective?
In Paper's Ablation Study, bottom row performance (-Trm-Logits) means It uses GT label Loss. Then would it be possible to say that TernaryBERT top row means It uses all three losses(Trm, Logits and GT Label)?
I'm little confused which loss should I use while reproducing TernaryBERT performance. It would be very helpful if you could answer my question.
Thanks in advance!