val-iisc / DeiT-LT

[CVPR 2024] Code for our Paper "DeiT-LT: Distillation Strikes Back for Vision Transformer training on Long-Tailed Datasets"
https://rangwani-harsh.github.io/DeiT-LT/
MIT License
33 stars 3 forks source link

Question about data augmentation #1

Closed dizzy144221 closed 3 months ago

dizzy144221 commented 4 months ago

Thank you for your great work on this project. I have a couple of questions regarding the data augmentation strategy:

  1. During the training process, did you apply any data augmentation techniques to the validation set? And during inference, did you apply data augmentation to the test set?
  2. During the training phase, did you only apply data augmentation to the entire training set, or did you also use the original, unaugmented data for training as well?

Thank you in advance for your time.

pradipto111 commented 4 months ago

Thanks for your interest in our work.

  1. The augmentations used for both validation and testing are -

    [transforms.Resize(256, interpolation=3),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD)]
  2. During training the augmentations were applied to the entire training set, the original un-augmented images were not included.

Best regards.

dizzy144221 commented 4 months ago

I get it.

And I have one last question that I hope you can help with. Why is it that when the ViT student tries to mimic the poorer results of a CNN, it actually ends up performing better? Why might the ViT be able to improve upon the inductive biases of the teacher instead of also deteriorating? This seems quite counterintuitive.

Thanks for your help again.

pradipto111 commented 4 months ago

Hi @dizzy144221,

Thanks for your question. The CNN teacher is poor on the out-of-distribution data. However it does have correct performance on the in-distribution test data. Hence, despite distilling the function learned by CNN into ViT through incorrect predictions. The distilled ViT performs well on the in-distribution test data, and shows improved performance.

Additionally, it is experimentally observed that teacher predictions for out-of-distribution images have a higher entropy i.e. more informative (Figure 3. of our paper) to distill from CNN.

This is the major contribution of our work as well. Please feel free to comment out in case you need any further clarifications.

Regards.