pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.07k stars 6.94k forks source link

Training recipe for these weights #5945

Closed briancheung closed 2 years ago

briancheung commented 2 years ago

https://github.com/pytorch/vision/blob/62740807c18e68bb0acd85895dca527f9a655bd5/torchvision/models/vision_transformer.py#L377

Does anyone know how these weights were generated. Where they training from scratch only on ImageNet 1k or was it pre-trained on ImageNet 21k? Looking at the original Vision transformer paper: https://arxiv.org/abs/2010.11929 I'm not quite sure where the accuracy numbers in these lines are coming from:

class ViT_B_32_Weights(WeightsEnum):
    IMAGENET1K_V1 = Weights(
        url="https://download.pytorch.org/models/vit_b_32-d86f8d99.pth",
        transforms=partial(ImageClassification, crop_size=224),
        meta={
            **_COMMON_META,
            "num_params": 88224232,
            "min_size": (224, 224),
            "recipe": "https://github.com/pytorch/vision/tree/main/references/classification#vit_b_32",
            "metrics": {
                "acc@1": 75.912,
                "acc@5": 92.466,
            },
        },
    )
    DEFAULT = IMAGENET1K_V1

Here's the corresponding numbers presented in the original Vision Transformer paper, ViT-B/32 accuracy of 75.912 is not in either the ImageNet 1k or the ImageNet 21k columns:

image

cc @datumbox

YosuaMichael commented 2 years ago

Hi @briancheung , let me try to answer your questions:

1) Does anyone know how these weights were generated. Where they training from scratch only on ImageNet 1k or was it pre-trained on ImageNet 21k? The weights is trained by torchvision maintainers. They are pre-trained on ImageNet 1K, with the following command parameters:

# train.py here refers to https://github.com/pytorch/vision/blob/main/references/classification/train.py
torchrun --nproc_per_node=8 train.py\
    --model vit_b_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
    --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
    --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment imagenet\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema

2) Here's the corresponding numbers presented in the original Vision Transformer paper, ViT-B/32 accuracy of 75.912 is not in either the ImageNet 1k or the ImageNet 21k columns

The accuracy 75.912 is gotten by testing the weights on the ImageNet 1K val data.

briancheung commented 2 years ago

Thank you, that clarifies things!