jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
MIT License
1.95k stars 374 forks source link

Loss can't drop #12

Closed QiushiYang closed 3 years ago

QiushiYang commented 3 years ago

Thank you so much for sharing your codes. I try to employ Vit as the encoder and follow a common decoder to build a segmentation network. I train it from scratch but found the loss can't drop since the beginning of training, and the results keep near 0. Is there any trick for training Vit correctly? Is it very important to load the pre-train model to fine-tune? Here is my configuration: patch_size=16 hidden_size=16*16*3 mlp_dim = 3072 dropout_rate = 0.1 num_heads = 12 num_layers = 12 lr=3e-4 opt=Adam weight_decay=0.0

jeonsworld commented 3 years ago

Hyperparameter in pre-train and fine-tuning have different settings. Also, if you are running the scratch train, you need to decide on the hyperparameter that fits it. Hyperparameters for pre-train and fine-tuning can be found in the paper.

QiushiYang commented 3 years ago

Thanks a lot for your reply. It was a coding bug and I have fixed the problems. Many thanks.