uchicago-cs / deepdish

Flexible HDF5 saving/loading and other data science tools from the University of Chicago
http://deepdish.io
BSD 3-Clause "New" or "Revised" License
271 stars 60 forks source link

A Mistake? #17

Closed ArEnSc closed 8 years ago

ArEnSc commented 8 years ago

http://deepdish.io/2014/10/28/hintons-dark-knowledge/

I was wondering if this was a mistake Hinton in his lecture slides says that raising the temperature was

yk/T

where as you have it as

yk^1/T

which is a squareroot ? if I use T ?

Can you explain ? this occurs for the denominator as well.

ArEnSc commented 8 years ago

http://www.ttic.edu/dl/dark14.pdf slide 6

ArEnSc commented 8 years ago

http://arxiv.org/pdf/1503.02531v1.pdf page 2

gustavla commented 8 years ago

In Hinton's slides, z is a score vector and to get a distribution over labels, you take y = softmax(z). In my blog post, y the distribution vector, not the score vector.

So, to change the temperature of z, you simply divide by T - no arguments there. However, to change the temperature when you already have y, you have take it to the power of 1/T and then re-normalize.

It follows from the fact that exp(z/T) = exp(z)^(1/T). Makes sense?

ArEnSc commented 8 years ago

Yes it does thank you!, I apologise I was reading this paper quite late last night. I have an related question is it possible to distil knowledge from a single model, or does this technique only work for ensambles of models ? does a function transfer work in a case where it is 1 model to a distilled model?

gustavla commented 8 years ago

I have not worked on this myself, but I have heard many accounts of people taking a single complicated network (not necessarily ensemble) and using the student-teacher paradigm to compress it into a smaller one.

ArEnSc commented 8 years ago

are there any papers on this topic ?

ArEnSc commented 8 years ago

thanks!