Open howardyclo opened 3 years ago
Basically, knowledge distillation aims to obtain a smaller student model from typically larger teacher model by matching their information hidden in the model. The information could be: final soft predictions, intermediate features, attentions, relations between samples. See this complete review.
Conventional knowledge distillation (KD) mimics teacher's single prediction on an image. SSKD considers to mimic the predictions of self-supervised contrastive pairs from teacher.
I haven't verified which one in this blogpost is the real SoTA… Let's just appreciate their ideas LOL.
When learning uncertainty var, they found that the var becomes small at the beginning and then stable during PAD training. Therefore they added an additional "warm-up" experiment in the above table and found that it performs slightly better than baselines but worse than the learned weights from PAD. Finally, combining PAD with warm-up training scheduling can achieve better results.
Metadata: Knowledge Distillation Meets Self-Supervision