huawei-noah / Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
3.02k stars 628 forks source link

Does Ternary BERT only use KD Loss(Teacher, Student Loss) while training? #145

Closed MarsJacobs closed 3 years ago

MarsJacobs commented 3 years ago

Hi, Thanks for this great source code. It really helps me a lot!

While I'm studying the TernaryBERT with Paper and this source code, I have a question about KD Training Loss. In Paper Algorithm1, It says that when compute the gradient, It only uses Distillation Loss, not the Distillation Loss + GT Label Cross Entropy Loss.

스크린샷 2021-09-25 오후 3 15 15

And also in source code, there is only KD loss which is used for backward. https://github.com/huawei-noah/Pretrained-Language-Model/blob/54ca698e4f907f32a108de371a42b76f92e7686d/TernaryBERT/quant_task_glue.py#L363-L392

Does TernaryBERT only use KD loss and not using ground truth label as training objective?

스크린샷 2021-09-25 오후 3 08 25

In Paper's Ablation Study, bottom row performance (-Trm-Logits) means It uses GT label Loss. Then would it be possible to say that TernaryBERT top row means It uses all three losses(Trm, Logits and GT Label)?

I'm little confused which loss should I use while reproducing TernaryBERT performance. It would be very helpful if you could answer my question.

Thanks in advance!

itsucks commented 3 years ago

Hi, the results of top row in table 6 come only from the KD loss(both intermediate layers and logit), just as the paper or the code shows. And this produced the best results in our experiment. '-Trm' means we train the ternary model with the distillation loss of logit and this is just to show the effectiveness of KD, compared with '-Trm-logits'. Thanks for your attention!

MarsJacobs commented 3 years ago

Thanks for your kind reply! Would it be okay if I ask one more question regarding this?

Is there any specific reason for using only KD Loss? (just because using KD Loss solely has better performance than using KD Loss and GT Label Loss?)

itsucks commented 3 years ago

Using only KD Loss is just the experience come from tinybert, and in our experiment adding GT Label Loss does not contribute to the final result. But I'm not saying that adding GT Label Loss is useless, there might be something to explore that we didn't.

MarsJacobs commented 3 years ago

Thanks for your reply!