howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation #75

Open howardyclo opened 3 years ago

howardyclo commented 3 years ago

Metadata: Knowledge Distillation Meets Self-Supervision

howardyclo commented 3 years ago

Prior Approaches on Knowledge Distillation

Basically, knowledge distillation aims to obtain a smaller student model from typically larger teacher model by matching their information hidden in the model. The information could be: final soft predictions, intermediate features, attentions, relations between samples. See this complete review.

Highlights

Conventional knowledge distillation (KD) mimics teacher's single prediction on an image. SSKD considers to mimic the predictions of self-supervised contrastive pairs from teacher.

Methods

howardyclo commented 3 years ago

Metadata: Prime-Aware Adaptive Distillation

howardyclo commented 3 years ago

Highlights

I haven't verified which one in this blogpost is the real SoTA… Let's just appreciate their ideas LOL.

Methods

Findings

When learning uncertainty var, they found that the var becomes small at the beginning and then stable during PAD training. Therefore they added an additional "warm-up" experiment in the above table and found that it performs slightly better than baselines but worse than the learned weights from PAD. Finally, combining PAD with warm-up training scheduling can achieve better results.

howardyclo commented 3 years ago

Further Readings