Closed ArEnSc closed 8 years ago
In Hinton's slides, z is a score vector and to get a distribution over labels, you take y = softmax(z). In my blog post, y the distribution vector, not the score vector.
So, to change the temperature of z, you simply divide by T - no arguments there. However, to change the temperature when you already have y, you have take it to the power of 1/T and then re-normalize.
It follows from the fact that exp(z/T) = exp(z)^(1/T). Makes sense?
Yes it does thank you!, I apologise I was reading this paper quite late last night. I have an related question is it possible to distil knowledge from a single model, or does this technique only work for ensambles of models ? does a function transfer work in a case where it is 1 model to a distilled model?
I have not worked on this myself, but I have heard many accounts of people taking a single complicated network (not necessarily ensemble) and using the student-teacher paradigm to compress it into a smaller one.
are there any papers on this topic ?
thanks!
http://deepdish.io/2014/10/28/hintons-dark-knowledge/
I was wondering if this was a mistake Hinton in his lecture slides says that raising the temperature was
yk/T
where as you have it as
yk^1/T
which is a squareroot ? if I use T ?
Can you explain ? this occurs for the denominator as well.