Closed dizzy144221 closed 3 months ago
Thanks for your interest in our work.
The augmentations used for both validation and testing are -
[transforms.Resize(256, interpolation=3),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD)]
During training the augmentations were applied to the entire training set, the original un-augmented images were not included.
Best regards.
I get it.
And I have one last question that I hope you can help with. Why is it that when the ViT student tries to mimic the poorer results of a CNN, it actually ends up performing better? Why might the ViT be able to improve upon the inductive biases of the teacher instead of also deteriorating? This seems quite counterintuitive.
Thanks for your help again.
Hi @dizzy144221,
Thanks for your question. The CNN teacher is poor on the out-of-distribution data. However it does have correct performance on the in-distribution test data. Hence, despite distilling the function learned by CNN into ViT through incorrect predictions. The distilled ViT performs well on the in-distribution test data, and shows improved performance.
Additionally, it is experimentally observed that teacher predictions for out-of-distribution images have a higher entropy i.e. more informative (Figure 3. of our paper) to distill from CNN.
This is the major contribution of our work as well. Please feel free to comment out in case you need any further clarifications.
Regards.
Thank you for your great work on this project. I have a couple of questions regarding the data augmentation strategy:
Thank you in advance for your time.