Closed adrianloy closed 5 years ago
Hi, Actually, it is same in here. As the entropy of the teacher is constant and does not affect the gradients and the cross entropy is equal to the summation of the teacher's entropy and the KL loss, you can use both.
Ah right, I discovered that the scalar of the loss is different but didnt think about the gradients being the same. Thanks.
Hi and thanks for sharing your code. Why did you choose to use Kullback-Leibler divergence loss? I think in the original Hinton paper, its only used when training from an ensemble to weight the different teacher distributions. When training from only one model usually Cross Entropy is used between the high temperature soft max outputs. Did you make any experiments with CE Loss? Or is there a specific reason to use KLDiv Loss instead?