lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
20.47k stars 3.04k forks source link

Train result on my own dataset. A Big gap between Train and Valid dataset #32

Open Erichen911 opened 3 years ago

Erichen911 commented 3 years ago

Hey guys. First of all. This is a great job and thanks to the authors. Then my question is... Recently I use this code on my own dataset. A simple binary-classification problem. The performance on the training dataset is good, but not as well as the validation dataset. The Loss curve is... image

My model is model = ViT( dim=128, image_size=224, patch_size=32, num_classes=2, depth=12, heads=8, mlp_dim=512, channels=3, )

Training Dataset has 1200+ images, Validation Dataset has 300+ images.

Can someone give me some suggestions, how to solve this problem?

I think there are several possibilities. Maybe I need a pretrained model? Or I did the wrong way in the training of transformer model?

lucidrains commented 3 years ago

@Erichen911 1200 is not enough! Off by 3 orders of magnitude at least!

lucidrains commented 3 years ago

@Erichen911 I would recommend getting a huge amount of images, preferrably a million at least, and then doing self-supervised learning with BYOL, before training on your tiny training set

otherwise, just use Ross' pretrained model!

henbucuoshanghai commented 3 years ago

@Erichen911 can you share your code?> tks

khawar-islam commented 3 years ago

Does anyone train Swin transformers with different image sizes? Like my image size is 112x112, it never works on this size

abhigoku10 commented 3 years ago

@lucidrains @Erichen911 can you share the train.py which you guys r using for custom data or any reference ??