Training recipe for these weights

briancheung commented 2 years ago

https://github.com/pytorch/vision/blob/62740807c18e68bb0acd85895dca527f9a655bd5/torchvision/models/vision_transformer.py#L377

Does anyone know how these weights were generated. Where they training from scratch only on ImageNet 1k or was it pre-trained on ImageNet 21k? Looking at the original Vision transformer paper: https://arxiv.org/abs/2010.11929 I'm not quite sure where the accuracy numbers in these lines are coming from:

class ViT_B_32_Weights(WeightsEnum):
    IMAGENET1K_V1 = Weights(
        url="https://download.pytorch.org/models/vit_b_32-d86f8d99.pth",
        transforms=partial(ImageClassification, crop_size=224),
        meta={
            **_COMMON_META,
            "num_params": 88224232,
            "min_size": (224, 224),
            "recipe": "https://github.com/pytorch/vision/tree/main/references/classification#vit_b_32",
            "metrics": {
                "acc@1": 75.912,
                "acc@5": 92.466,
            },
        },
    )
    DEFAULT = IMAGENET1K_V1

Here's the corresponding numbers presented in the original Vision Transformer paper, ViT-B/32 accuracy of 75.912 is not in either the ImageNet 1k or the ImageNet 21k columns:

cc @datumbox

YosuaMichael commented 2 years ago

Hi @briancheung , let me try to answer your questions:

1) Does anyone know how these weights were generated. Where they training from scratch only on ImageNet 1k or was it pre-trained on ImageNet 21k? The weights is trained by torchvision maintainers. They are pre-trained on ImageNet 1K, with the following command parameters:

# train.py here refers to https://github.com/pytorch/vision/blob/main/references/classification/train.py
torchrun --nproc_per_node=8 train.py\
    --model vit_b_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
    --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
    --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment imagenet\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema

2) Here's the corresponding numbers presented in the original Vision Transformer paper, ViT-B/32 accuracy of 75.912 is not in either the ImageNet 1k or the ImageNet 21k columns

The accuracy 75.912 is gotten by testing the weights on the ImageNet 1K val data.

briancheung commented 2 years ago

Thank you, that clarifies things!

pytorch / vision

Training recipe for these weights #5945