Open LouiValley opened 2 years ago
We never trained ViTs from scratch on cifar-10. The smallest dataset we used for pre-training was imagenet2012, and we usually recommend pre-training from larger datasets (as opposed to pre-training from smaller datasets with more augmentation). For a empirical study about dataset size, dataset augmentation, model regularization, and compute, please refer to How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
@andsteing Is there any specific reason why you never trained it from scratch on cifar-10? I would like to know if my custom extension of ViT is sound, and I thought that cifar-10 would be a good testing ground, but I can't seem to reach a decent training accuracy/loss, even if I can overfit a single batch albeit a bit slowly.
CIFAR-10 is a very small dataset. If you want to train something from scratch on that dataset, I would recommend using a ConvNet. The reason is that ConvNets by their design have some properties that make them a natural fit for image processing (e.g. translational invariance is due simply to the way how convolutions are applied). Vision Transformers on the other side need to learn these priors from the data, which works well with large data, but fails with small data.
See for example Figure 3 from An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale that shows that ConvNets ("BiT") performs better with medium-sized datasets ("ImageNet"), but ViTs start to perform better with larger data ("JFT-300M").
Yes ofc, but I was wondering how one should like debug the model to make sure that an extension is not doing something totally off. My guess is that when the ViT was developed the first proof of concept experiments were not done on Image et (but I might be wrong haha)
Then I would recommend simply using a trusted implementation and running it on a small dataset to get some rough estimate of the expected performance. You could then use those numbers to verify another implementation.
Note though that some bugs might only show up on larger data and longer training.
Agreed, so basically I tried that with the vision_transformer implemented in this repo, and tried to train it on CIFAR-10 to achieve a reasonable performance like the ones reported in these 2 repos:
Unfortunately, I can't seem to reach this level of performance with the flax ViT. I only get to about 52% and it's not because of overfitting visibly because the training and test loss are roughly the same.
I tried debugging by making sure that I can overfit a single batch and it's very much possible to do so.
I will investigate more because I want this to work, but in the meantime if you also want to investigate I can send you a script to replicate my current issue.
Excuse, what 's the accuracy of top-1 cifar-10 without dropout and resize, if training from the scratch? For example Mixer-B/16.