Closed briancheung closed 2 years ago
Hi @briancheung , let me try to answer your questions:
1) Does anyone know how these weights were generated. Where they training from scratch only on ImageNet 1k or was it pre-trained on ImageNet 21k? The weights is trained by torchvision maintainers. They are pre-trained on ImageNet 1K, with the following command parameters:
# train.py here refers to https://github.com/pytorch/vision/blob/main/references/classification/train.py
torchrun --nproc_per_node=8 train.py\
--model vit_b_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
--lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
--lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment imagenet\
--clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
2) Here's the corresponding numbers presented in the original Vision Transformer paper, ViT-B/32 accuracy of 75.912 is not in either the ImageNet 1k or the ImageNet 21k columns
The accuracy 75.912
is gotten by testing the weights on the ImageNet 1K val data.
Thank you, that clarifies things!
https://github.com/pytorch/vision/blob/62740807c18e68bb0acd85895dca527f9a655bd5/torchvision/models/vision_transformer.py#L377
Does anyone know how these weights were generated. Where they training from scratch only on ImageNet 1k or was it pre-trained on ImageNet 21k? Looking at the original Vision transformer paper: https://arxiv.org/abs/2010.11929 I'm not quite sure where the accuracy numbers in these lines are coming from:
Here's the corresponding numbers presented in the original Vision Transformer paper, ViT-B/32 accuracy of 75.912 is not in either the ImageNet 1k or the ImageNet 21k columns:
cc @datumbox