Does this method require teacher head and student head to have the same number of input channels?

jbwang1997 / CrossKD

CrossKD: Cross-Head Knowledge Distillation for Dense Object Detection

Other

143 stars 18 forks source link

Hi,

Thanks for the paper and code. I get the idea of feeding the student's backbone features to the teacher's prediction head. My question is , does this require the student's backbone to have the same number of output channels as the teacher's (which seems rarely the case for networks with different size)? Also, how does the method perform if the student's and teacher's backbones have different number of output channels, and the number of channels have to be aligned by some way, e.g. adding a conv layer? Do you have any empirically results on this? Thank you for your help!

jbwang1997 / CrossKD

Does this method require teacher head and student head to have the same number of input channels? #8