Open OutstanderWang opened 3 years ago
we use the same random crop + flip augmentation for teacher and student so they see the same image (didn't find different augmentation helpful but it could be that hyperparam tuning). the student model has the same backbone as the teacher but without the projection head layer, so there's a small difference in architecture. it is not easy for the student to replicate (exactly) the teacher (due to soft labels), so it is possible there's a regularization effect.
In the paper, authors claim that Distillation with unlabeled examples improves fine-tuned models in two ways, as shown in Figure 6: (1) when the student model has a smaller architecture than the teacher model, it improves the model efficiency by transferring task-specific knowledge to a student model (2) even when the student model has the same architecture as the teacher model (excluding the projection head after ResNet encoder), self-distillation can still meaningfully improve the semi-supervised learning performance.
It is well known that distillation can compress the model (the 1st point). However, why knowledge distillation can improve performance? I think the best student model can just predict as well as the teacher model since the student model tries to fit the teacher model's output.
Do the teacher and student models augment the input x in different ways like the noisy student? But I didn't see it in the paper or collab codes. Furthermore, the temperature of student and teacher models is the same. @chentingpc