Why knowledge distillation improves the test accuracy?

In the paper, authors claim that Distillation with unlabeled examples improves fine-tuned models in two ways, as shown in Figure 6: (1) when the student model has a smaller architecture than the teacher model, it improves the model efficiency by transferring task-specific knowledge to a student model (2) even when the student model has the same architecture as the teacher model (excluding the projection head after ResNet encoder), self-distillation can still meaningfully improve the semi-supervised learning performance.

It is well known that distillation can compress the model (the 1st point). However, why knowledge distillation can improve performance? I think the best student model can just predict as well as the teacher model since the student model tries to fit the teacher model's output.

Do the teacher and student models augment the input x in different ways like the noisy student? But I didn't see it in the paper or collab codes. Furthermore, the temperature of student and teacher models is the same. @chentingpc

google-research / simclr

Why knowledge distillation improves the test accuracy? #159