imirzadeh / Teacher-Assistant-Knowledge-Distillation

Using Teacher Assistants to Improve Knowledge Distillation: https://arxiv.org/pdf/1902.03393.pdf
MIT License
256 stars 47 forks source link

Choice of Loss Function #4

Closed adrianloy closed 5 years ago

adrianloy commented 5 years ago

Hi and thanks for sharing your code. Why did you choose to use Kullback-Leibler divergence loss? I think in the original Hinton paper, its only used when training from an ensemble to weight the different teacher distributions. When training from only one model usually Cross Entropy is used between the high temperature soft max outputs. Did you make any experiments with CE Loss? Or is there a specific reason to use KLDiv Loss instead?

aminshabani commented 5 years ago

Hi, Actually, it is same in here. As the entropy of the teacher is constant and does not affect the gradients and the cross entropy is equal to the summation of the teacher's entropy and the KL loss, you can use both.

adrianloy commented 5 years ago

Ah right, I discovered that the scalar of the loss is different but didnt think about the gradients being the same. Thanks.