Closed cashincashout closed 3 years ago
All the numbers reported in this paper do NOT employ mixup & cutmix.
In the early stage of this project, I did several ablation studies to explore DeiT augmentation (including mixup & cutmix) for self-supervised learning, but did not see performance improvement. The code with mixup & cutmix augmentation remains in the code release, so that people who are interested can explore more.
For your request "vanilla DINO with Swin-T/Swin-B as the backbone, i.e., EsViT w. only view-level task, w.o. mixup & cutmix", please see this newly added table in README.md:
arch | params | tasks | linear | k-nn | download | logs | ||
---|---|---|---|---|---|---|---|---|
ResNet-50 | 23M | V | 75.0% | 69.1% | full ckpt | train | linear | knn |
EsViT (Swin-T, W=7) | 28M | V | 77.0% | 74.2% | full ckpt | train | linear | knn |
EsViT (Swin-S, W=7) | 49M | V | 79.2% | 76.9% | full ckpt | train | linear | knn |
EsViT (Swin-B, W=7) | 87M | V | 79.6% | 77.7% | full ckpt | train | linear | knn |
Hi @ChunyuanLI, I've noticed the usage of mixup and cutmix during pre-training, which is not included in DINO. I'm wondering the performance gain brought by applying mixup & cutmix. Have you ever run any related experiments pre-trained w.o. mixup? I'm especially interested in vanilla DINO with Swin-T/Swin-B as the backbone, i.e., EsViT w. only view-level task, w.o. mixup & cutmix. It would be nice if you could inform me of those results.