Base network -> ViT
We use the MNIST and CIFAR10 datastes to show the framework works well in practice.
And any model can apply this framework by self-supervised sequence module.
MNIST | Accuracy | Student netowrk latent space(PCA, TSNE) | Student attention map |
---|---|---|---|
CIFAR10 | Accuracy | Student netowrk latent space(PCA, TSNE) | Student attention map |
---|---|---|---|
tensorflow = 2.10
[1] Baevski, Alexei, et al. "Data2vec: A general framework for self-supervised learning in speech, vision and language." arXiv preprint arXiv:2202.03555 (2022).
[2] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).